Literature DB >> 33289531

Parallelized calculation of permutation tests.

Markus Ekvall¹, Michael Höhle², Lukas Käll¹.

Abstract

MOTIVATION: Permutation tests offer a straightforward framework to assess the significance of differences in sample statistics. A significant advantage of permutation tests are the relatively few assumptions about the distribution of the test statistic are needed, as they rely on the assumption of exchangeability of the group labels. They have great value, as they allow a sensitivity analysis to determine the extent to which the assumed broad sample distribution of the test statistic applies. However, in this situation, permutation tests are rarely applied because the running time of naïve implementations is too slow and grows exponentially with the sample size. Nevertheless, continued development in the 1980s introduced dynamic programming algorithms that compute exact permutation tests in polynomial time. Albeit this significant running time reduction, the exact test has not yet become one of the predominant statistical tests for medium sample size. Here, we propose a computational parallelization of one such dynamic programming-based permutation test, the Green algorithm, which makes the permutation test more attractive.
RESULTS: Parallelization of the Green algorithm was found possible by non-trivial rearrangement of the structure of the algorithm. A speed-up-by orders of magnitude-is achievable by executing the parallelized algorithm on a GPU. We demonstrate that the execution time essentially becomes a non-issue for sample sizes, even as high as hundreds of samples. This improvement makes our method an attractive alternative to, e.g. the widely used asymptotic Mann-Whitney U-test. AVAILABILITYAND IMPLEMENTATION: In Python 3 code from the GitHub repository https://github.com/statisticalbiotechnology/parallelPermutationTest under an Apache 2.0 license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Disease Gene

Mesh：

Year: 2021 PMID： 33289531 PMCID： PMC8016463 DOI： 10.1093/bioinformatics/btaa1007

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Permutation tests are frequently used for non-parametric testing and are incredibly valuable within computational biology, with applications within genome-wide association studies (Browning, 2008; Dudbridge and Gusnanto, 2008; Purcell et al., 2007), Pathway Analysis (Jeuken and Käll, 2018; Subramanian et al., 2005) and expression quantitative trait loci studies (Doerge and Churchill, 1996; Sul et al., 2015). Monte Carlo-based sampling techniques (Segal et al., 2018) and exact tests that derive full permutation distributions are roughly the two ways to implement the permutation test. Exact tests are traditionally seen as unattractive for large sample sizes, as the number of permutation grows super-exponentially with the sample size. Nonetheless, Green’s dynamic programming algorithm (Green, 1977), which was made explicit by others (Pagano and Tritchler, 1983; Zimmermann, 1985), partially overcome this computational problem. This algorithm is significantly less computationally demanding than the naïve approach. However, the exact test’s popularity for larger sample sizes has not attracted much attention in the last couple of decades. We report here on an extension using a Graphics processing unit (GPU) implementation to compute parallelized exact tests and found it superior to the other tested alternatives in terms of speed and accuracy.

2 Algorithm

Here, we will describe our parallelization of the Green method to calculate exact tests. First, in Section 2.1, we describe the main objective, perform hypothesis testing with an exact test, then a description of the Green algorithm in Section 2.2, and, finally, a description of how to parallelize the algorithm in Section 2.2.

2.1 Hypothesis testing

Consider a two-sample hypothesis testing setting for central tendency, i.e. we want to know if the considered response in one group B is generally larger than in another group A. Let and be two independent samples of non-negative integers valued scores from the distributions and , respectively. So, and for ; and and for . We also form the concatenation of the samples as . Our interest is in investigating the hypothesis versus the one-sided alternative that the response in group B tends to be larger than the response in group A. This alternative can be formalized for our case of discrete distributions by, e.g. introducing and letting (Fay and Proschan, 2010). A way to perform the test is to determine how extreme the observed sum of sample is under the null hypothesis. Typically, one assumes a particular parametric family of distributions and computes the P value of the test as the probability of observing s or a more extreme value in the alternative’s direction under the assumption that the null-hypothesis is true. However, it is rarely possible to analytically compute this probability, and one often has to resort to asymptotic approximations. A permutation test approach to the testing problem is to assume instead that the labels A and B are exchangeable under H0: Under the null hypothesis the distribution of the test statistic would remain the same for any permutation of . However, this property would not hold under the alternative H (Huang et al., 2006). One can thus calculate how frequently samples with sample sums greater or equal than s appears when resampling from . We can formulate the P value as , where is the probability mass function, and S is a random variable denoting the sum’s value in the first sample under the permutation distribution. The computationally expensive part is to obtain which is estimated by concatenating and to , and draw all possible subsets of length m from and count the number of occurrences of all possible sums; when the numbers of occurrences of all possible sums are available, the distribution is accessible. Assume a random variable , that is a randomly sampled subset from with j elements and its corresponding sum is , where . Define to be the number of ways we can sample subsets with m elements in such a way that their sum S = s. Now can be expressed as the fraction of ways that a subset can be sampled so that its sum ends up to as S = s to the number of ways it can be sampled with any sum Combinatorics gives us that the denominator of the above Equation 1 can be expressed as, The calculation of the numerator is intricate. A naïve algorithm that would exhaustively calculate the sum for each possible subset of size m and compare it to s, for all s, would need calculations, which becomes computational prohibitive even for moderate set sizes m. However, in Section 2.2, we will discuss an algorithm solving the problem within polynomial time. Now, the sought P value can be calculated as, Alternatively, the same framework can be used to calculate the mid P value (Routledge, 1994) as, As mentioned above, there is no closed formula to calculate . However, it is possible to develop a dynamic programming algorithm to obtain (Pagano and Tritchler, 1983; Zimmermann, 1985), described thoroughly in the next section.

2.2 Efficient calculation of []: the Green Algorithm

A dynamic programming algorithm for calculating was first presented in Green (1977) and more in detail described by others (Gebhard and Schmitz, 1998; Zimmermann, 1985). Here, we will give a walk-through of the algorithm, mostly to describe its parallelization in Section 2.3. We also provide a simple use case of the algorithm in Supplementary Note S1. We can find a recursive expression for by considering a scenario where is drawn instead of the full set from a subset consisting of only the first i features of . To do so; we first need some definitions. Define as the number of ways we can sample j elements so that their sum becomes s from a subset . If we know how to calculate , we also know how to calculate , since . We also define to be the number of ways we can sample so that the sample elements sum to s and including the last element x in the sample. We can now form a recursion of as the number of ways to sample has to be equal to the number of ways to sample with the number of ways to sample that include x. So, We can express M in terms of by noting that when , and otherwise zero. We can hence express the recursion 4 as, Let’s now turn to the boundary conditions of this recursion. The empty set trivially reach the sum s = 0, thus We cannot sample j from i elements if j > i. Also, has to be zero when either (i.e. checking the empty set of ), when s < 0, or when (i.e. when s is outside the boundary of possible sums). Hence, By combining the base case 7 and 6 with the generic sub-recursion 5, the final recursion is, The pseudo-code of the top-down dynamic programming code of the recursion in Equation 8 is given in Supplementary Algorithm S1. The algorithm needs some explanation. Instead of using one extensive array , two smaller arrays, and , are swapped and rewritten in each iteration of i in an oscillatory fashion—to save memory, see line 24. It is the complete rewriting of N in the next iteration that makes this possible (any of the conditions in the recursion relation will re-calculate each entry N). We save quite some memory by keeping N and N instead of the full three-dimensional array. The former two arrays require , whereas the latter array requires . Moreover, this improvement in memory usage is a quintessential difference for the parallelized algorithm (GPU’s memory storage can easily be a bottleneck). A second point, notice the structure of the for-loops; they could easily have been arranged in whatever way and still obtain correct computations. However, the dimensions of the two arrays have to be adjusted appropriately related to the outer-most loop. Nevertheless, this specific order has a purpose; it is parallelizable—described in the next section. One final note on Supplementary Algorithm S1, by plain observation on the nested loops, it easy to see that the running time is .

2.3 Parallelization over two dimensions: s and j

The meaning of parallelizing over two dimensions is, in this case, to fix one of the variables and check whether the two other variables are parallelizable given the third fixed variable. In practice, the fixed variable is the outer-most for-loop, and for each iteration of this variable, then within this loop, everything is calculated in parallel over the two other variables. Out of the three variables, it is only necessary to find one such variable to fix, and instead of exhaustively checking all possibilities, one can check recursion 9 to see which variable is parallelizable. By comparing the left side to the right side, the only variable that is not dependent on contemporary action of itself is variable i, i.e. , and, furthermore, the other two variables are not possible to fix. Below is a verification that is parallelizable given that i is fixed. Initiation: In the lines 6–9 in Supplementary Algorithm S1, constants are set, and initialization of both arrays, and , occur. There is no conflict for the parallelization of this step. Maintenance: Consider iteration i. In lines 13–16, the array N is only dependent on constants (i.e. the boundary conditions 8 and 7). Thus, the computation of N is parallelizable. Furthermore, in lines 17–20, N is only dependent on elements from N and x, which both are invariant for i = 1 (except for N, that switch values at the end of the iteration i, however, all computations for i are already done). Therefore, N is parallelizable here too. Finally, at line 24, N is switched with N, and no parallelization occurs here. Hence, the algorithm is parallelizable for iteration i. When entering the next iteration of i, i.e. , the same arguments above apply. Termination: When arriving at line 10 in Supplementary Algorithm S1 and , it will not enter the loop. By the maintenance of Algorithm 2.3, one can be sure that the computation of is correct. Since there is no computation after the for-loop-block, hence, there are no more modifications on , and it is safe to return this array.

2.4 Discretization of real numbered samples into integer valued samples

Our test is defined for integer valued distributions of the scores, . However, approximately we may use the procedure if we first discretize any real value distributed scores, and , where . This was achieved by partitioning the samples range into n discretization windows, each of length, . Each of these windows covers , for , which enables us to map any continues sample into discrete scores, as , with values in the interval . Unfortunately, this comes to the cost of discretization errors, which will be a function of the number of discretization windows, n. We will investigate the effects of such discretization in Section 4.

3 System and methods

3.1 Compared methods

A couple of methods were used as comparison to our implementation. We implemented t tests through the function scipy.stats.ttest_ind, and Mann-Whitney U tests through scipy.stats.mannwhitneyu, both functions in scipy version 1.4.1. We downloaded the FastPerm method from the master branch of https://github.com/bdsegal/fastPerm. When executing FastPerm we applied the default parameter-tuning as described in the implementation guide, available at the method’s GitHub repository. We implemented the r-package Coin (https://CRAN.R-project.org/package=coin) version of the shift-method (exact test) as part of our python package and used it as a benchmark.

3.2 System

Performance figures were recorded on a 8 core Intel i7-9700K with an NVIDIA GeForce RTX 2070 graphics card. Some of the experiments we compared this configuration’s performance to similar computers equipped with NVIDIA GeForce RTX 2060 and one with a NVIDIA Titan X Pascal graphics card.

3.3 Implementation

A python 3.6 implementation implementing the below algorithms was made available under an Apache 2.0 license. We implemented our algorithm together with our discretization strategy as a CUDA (Chakrabarti et al., 2012) enabled Python module. The implementation and all results of this paper are available in reproducible form from a GitHub repository https://github.com/statisticalbiotechnology/parallelPermutationTest.

4 Results

We implemented the algorithm described above, and set out to characterize the algorithm’s performance. To establish that the strategy executes in a practically useful time scale, we first tested the running time performance as a function of the number of discretization windows and its dependence on sample size. Subsequently, we tested the accuracy of our discretization strategy to establish that it is asymptotically unbiased and precise compared to other methods. Finally, we applied the method to a relatively large-scale proteomics dataset to establish the methods’ usefulness in a practical test scenario. For some of the tests, we were able to benchmark our GPU implementation of the Green algorithm (named Green Cuda in the following text) against other methods. Here we implemented a single-thread version of the Green algorithm (Green Singlethread) on the CPU and a multi-thread version of Green algorithm (Green Multithread) on the CPU and downloaded a previously described Monte Carlo-based sampling method called the fast permutation method, FastPerm (Segal ), the shift algorithm’s implementation in the popular r-package Coin (Hothorn et al., 2006; Hothorn et al., 2008) for exact permutation test (here named Shift Coin), and used Python scipy’s implementation of the t test and Mann-Whitney U test.

4.1 Test of running time requirements

Running time as a function of sample size, n

We first tested the running time requirements of the testes methods as a function of sample size. Here, we selected uniformly distributed samples, and , and drew samples of size . For each , using five samples replicates. The average time to calculate the corresponding P values was recorded and plotted as a function of n (Fig. 1). For the tested sample sizes, Green Cuda scales well with sample size, and for all sample sizes, Green Cuda was found between 15 and 50 times faster than FastPerm, see Figure 1b at n = 300 and n = 100, respectively. However, they seem to reach a similar running time around n = 500.

Fig. 1.

Running time requirements of the compared algorithms. We plotted the time required to calculate P values for samples from and for different sample sizes n. The mean time and the 95% confidence interval around the mean time to calculate each of 5 replicate samplings was plotted in (a) linear and (b) log scale. It should be noted that the execution times where highly reproducible, and it might be hard to see the confidence interval in the plots. (c) We further investigated the running time as a function of the number of Monte Carlo samples for a MC-based approach. We compared running time for a Monte Carlo sampled to draw samples of , with five replicates. The horizontal lines represent the running time for Green Cuda to compute the same samples The Green Singlethread should have a running time that expands as as n was held constant in the experiment. This corresponds well with our observations in Figure 1. Some Monte Carlo-based approaches are, unlike e.g. FastPerm, leaving the choice of the number of sampled permutations to their users. The number of permutations is inversely proportional to the granularity of the P values estimated by the MC-sampler, and hence governs their possible precision, at least when no other techniques such as importance sampling is used. We hence measure the running time for different amounts of permutation samples with the MC-sampling functionality of the package Coin, putting it in the context of the running time of Green Cuda (Fig. 1c). The two methods seem to have similar running times for about 105–106 permutation samples, tentatively suggesting that the Green Cuda will be the faster method of the two methods, whenever a precision of the estimated P values is desired to be better than –. Furthermore, to see how the running time depends on the hardware, the test was repeated on other GPUs. However, we found a relatively small difference in performance between the tested graphic cards (see Supplementary Fig. S1).

Running time as a function of the number discretization windows, n

We also wanted to establish that our implementation scales well with the number of discretization windows used for the test. Again, we sampled from and with sample size , with five replicates. We plotted the running time as a function of in Figure 2.

Fig. 2.

Running time as a function of the number of discretization windows. We plotted the mean of the mean and standard deviation of the required running time, as wall time, as a function of the number of discretization windows, n. Time was plotted (a) in normal scale, and (b) log-scale. Note that the similarity in execution speed makes it hard to separate the series for Green Multithread and Coin Shift. Also note that the other methods were excluded from the plot, as the discretization step is exclusively present in the Green algorithm The single-threaded implementation of the Green algorithm (Green singlethread) should theoretically have a running time complexity . As sample size n is kept constant in the experiment, we expect the running time to expend as . Indeed, Figure 2 confirms this for Green singlethread.

4.2 Memory allocation

Test of memory allocation

We characterized the memory requirements of Green Cuda by increasing sample size n and m, and a growing number of bins n. The first experiment (Fig. 3a), varied the set sizes , for three different bin sizes n of 64, 128 and 256. In the second experiment (Fig. 3b), the number of bins was the variable for three different set sizes of 125, 250 and 500. In both experiments, the data was sampled from and and the number replicates was 1000.

Fig. 3.

Memory allocation as a function of set size and bin size for Green Cuda. We plotted the required memory allocation to calculate P values for samples from and for different sample sizes n and bin sizes . In (a) the n used were 64, 128 and 512 with 5 replicates, and in (b) n were 125, 250 and 500 with 5 replicates The amount of data that can be handled by a GPU at each point in time is limited by the GPU’s memory size. Here we used an NVIDIA GeForce RTX 2070, which allows for 7982 MiB memory allocation. One could imagine settings where is so large that one cannot calculate N even for one data point. However, we would instead run into floating-point problems before reaching this type of memory problem for such cases. For large set-sizes, , the sums in the entries in N start go beyond the maximum double-point precision in CUDA i.e. . A future improvement of our algorithm could be to reduce the values in each iteration by dividing all elements in by a normalizing factor in each iteration over i, which would help us reduce the problems of N becoming too large.

4.3 Test of accuracy and precision

Test dependency of window size

While the Green Algorithm is a non-parametric test for discrete test statistics, the algorithm’s performance will be a function of the number of windows, n, we use for discretizing any continuous data. We hence wanted to characterize the influence of n on the accuracy of our test. We selected samples from a Normal distribution and compared the computed P values with the ones from a regular t test (Supplementary Fig. S2). The results suggest that both the accuracy and precision of the test improves when increasing the number of windows. However, the effect seems to saturate for .

Comparison to other method’s accuracy and precision against comparative methods

We subsequently wanted to test the accuracy of the estimated P values. Again we used Normal distributed samples and used t test-calculated P values as reference. As benchmark comparisons, we again used the FastPerm method and Python’s asymptotic Normal distribution implementation of the Mann-Whitney U test, scipy.stats.mannwhitneyu. We plotted the ratio, , where is the tested P value and is given by a t test, as a function of sample size for two different effect sizes (Fig. 4). Overall, the of the Green Cuda were found closer to 1, and less dependent on the sample size than the ones from the compared methods. The reason for the Mann-Whitney U test deviating from the results of the t test, particularly in Figure 4b, is that the test has comparatively low efficiency when testing on normal distributed data. It is also important to note that the Mann-Whitney U test, which depends on ranking statistics, would not be under the same null model as the compared methods in the presence of ties. However, we expect ties to be rare when sampling from continuous distributions.

Fig. 4.

Comparison of estimation error as a function of sample size. We plotted the fold change between one side Green Cuda, FastPerm and Mann-Whithey U test and on the other side a t test, as a function of the sample size, n, when and (a) , and (b) . For both cases we plotted results from 50 samples, and used n = 100 discretization windows. Note that the estimation errors for Green Multithread, Green Multithread, Green Cuda and Coin Shift are identical, so we compressed the reults into onse series, labeled Green

Calibration test

We also tested the parallelized shift method’s P values uniformity under the null hypothesis (Murdoch et al., 2008). This property is of particular importance for studies where we test many variables for the same sample, as it is the base for efficient multiple hypothesis testing (Efron, 2012; Storey and Tibshirani, 2003). Here, we compared Green Cuda against the FastPerm method, Mann-Whitney U test and a regular t test. We picked 10000 samples from the Normal distribution and a log-Normal distribution and plotted each method has estimated P values as a function of their relative rank (Supplementary Fig. S3). We found that the calibration of the parallel Green method was on par with the test. However, unsurprisingly the calibration seems to be entirely off for the t tests on log-Normal distributed data. We see that the Mann-Whitney U test’s calibration appears conservative, while the FastPerm method appears anti-conservative for both tested distributions.

4.4 Running time requirements for a proteomics dataset

As the last test, we tested the algorithm’s performance on a dataset of breast cancer samples from the CPTAC consortium (NCI CPTAC, 2016). For the 8051 proteins for which measurements had been obtained for all samples, we tested differential abundance between 80 non-triple-negative and 26 triple-negative samples. The run-time for Green Cuda method can be found in Table 1. For the other methods: FastPerm took 45 min, 1.5 s for a Mann-Whitney U test and 1.32 s for a t test.

Table 1.

Running time for Green Cuda on the proteomics dataset as a function of the number of discretization windows n

n_w	16	64	256	512	1028
Time (s)	6, 31	6, 87	10, 8	17, 5	37

Running time for Green Cuda on the proteomics dataset as a function of the number of discretization windows n

5 Discussion

Statistical testing is the base for most scientific activities. Also, in most research areas, the amount of public data is rapidly increasing, and hence there is a need for ever more efficient methods to compute significance. Permutation tests offer an exciting method as they do not assume a particular sampling distribution, but instead, build one by permuting label associations to the observed data. This approach corresponds perfectly with the null hypothesis that there is no difference in case and control outcomes. Here, we have described a parallelized dynamic programming method to perform permutation tests. We have demonstrated that it is faster and more accurate than the sampling-based methods. Previous work by Pagano and Trichler (1983) demonstrates that one can quickly expand exact tests to handle missing values, something that rank-based not easily can handle. We note that several studies are dependent on normal approximations of non-parametric tests such as the Mann-Whitney U test. In practice, the implementation of such tests is approximations as they are asymptotic and not exact. The Green Cuda method offers an exact test that does not appear much slower but more accurate than such tests. However, admittedly, the difference between the outcomes of asymptotic and permutation tests gets smaller with an increased sample size. Permutation tests have been successfully used in many, if not most areas of bioinformatics as a relatively assumption-free method for assessing statistical significance in inferences. In most applications the procedures involve some flavor of Monte Carlo-based sampling methodology. Here we demonstrated that for at least in the case of statistical hypothesis testing on can instead rely on calculating a full distribution of the sampling space. Click here for additional data file.

12 in total

1. PLINK: a tool set for whole-genome association and population-based linkage analyses.

Authors: Shaun Purcell; Benjamin Neale; Kathe Todd-Brown; Lori Thomas; Manuel A R Ferreira; David Bender; Julian Maller; Pamela Sklar; Paul I W de Bakker; Mark J Daly; Pak C Sham
Journal: Am J Hum Genet Date: 2007-07-25 Impact factor: 11.025

2. Permutation tests for multiple loci affecting a quantitative character.

Authors: R W Doerge; G A Churchill
Journal: Genetics Date: 1996-01 Impact factor: 4.562

3. Accurate and fast multiple-testing correction in eQTL studies.

Authors: Jae Hoon Sul; Towfique Raj; Simone de Jong; Paul I W de Bakker; Soumya Raychaudhuri; Roel A Ophoff; Barbara E Stranger; Eleazar Eskin; Buhm Han
Journal: Am J Hum Genet Date: 2015-05-28 Impact factor: 11.025

4. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.

Authors: Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov
Journal: Proc Natl Acad Sci U S A Date: 2005-09-30 Impact factor: 11.205

5. Wilcoxon-Mann-Whitney or t-test? On assumptions for hypothesis tests and multiple interpretations of decision rules.

Authors: Michael P Fay; Michael A Proschan
Journal: Stat Surv Date: 2010

6. PRESTO: rapid calculation of order statistic distributions and multiple-testing adjusted P-values via permutation for one and two-stage genetic association studies.

Authors: Brian L Browning
Journal: BMC Bioinformatics Date: 2008-07-13 Impact factor: 3.169

7. A simple null model for inferences from network enrichment analysis.

Authors: Gustavo S Jeuken; Lukas Käll
Journal: PLoS One Date: 2018-11-09 Impact factor: 3.240

8. Fast approximation of small p-values in permutation tests by partitioning the permutations.

Authors: Brian D Segal; Thomas Braun; Michael R Elliott; Hui Jiang
Journal: Biometrics Date: 2017-05-18 Impact factor: 2.571

9. Estimation of significance thresholds for genomewide association scans.

Authors: Frank Dudbridge; Arief Gusnanto
Journal: Genet Epidemiol Date: 2008-04 Impact factor: 2.135

10. Proteogenomics connects somatic mutations to signalling in breast cancer.

Authors: Philipp Mertins; D R Mani; Kelly V Ruggles; Michael A Gillette; Karl R Clauser; Pei Wang; Xianlong Wang; Jana W Qiao; Song Cao; Francesca Petralia; Emily Kawaler; Filip Mundt; Karsten Krug; Zhidong Tu; Jonathan T Lei; Michael L Gatza; Matthew Wilkerson; Charles M Perou; Venkata Yellapantula; Kuan-lin Huang; Chenwei Lin; Michael D McLellan; Ping Yan; Sherri R Davies; R Reid Townsend; Steven J Skates; Jing Wang; Bing Zhang; Christopher R Kinsinger; Mehdi Mesri; Henry Rodriguez; Li Ding; Amanda G Paulovich; David Fenyö; Matthew J Ellis; Steven A Carr
Journal: Nature Date: 2016-05-25 Impact factor: 49.962