| Literature DB >> 26594243 |
François Van Lishout1, Francesco Gadaleta1, Jason H Moore2, Louis Wehenkel1, Kristel Van Steen1.
Abstract
BACKGROUND: The purpose of the MaxT algorithm is to provide a significance test algorithm that controls the family-wise error rate (FWER) during simultaneous hypothesis testing. However, the requirements in terms of computing time and memory of this procedure are proportional to the number of investigated hypotheses. The memory issue has been solved in 2013 by Van Lishout's implementation of MaxT, which makes the memory usage independent from the size of the dataset. This algorithm is implemented in MBMDR-3.0.3, a software that is able to identify genetic interactions, for a variety of SNP-SNP based epistasis models effectively. On the other hand, that implementation turned out to be less suitable for genome-wide interaction analysis studies, due to the prohibitive computational burden.Entities:
Keywords: 3-order interactions; Algorithmic; Gamma distribution; Genome-wide interaction studies; MaxT; Multiple testing; SNP-environment interactions
Year: 2015 PMID: 26594243 PMCID: PMC4654922 DOI: 10.1186/s13040-015-0069-x
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Fig. 1classical MaxT versus Van Lishout’s implementation of MaxT. In the classical implementation of MaxT, all T values are computed and stored in memory, , ∀j=1…m. Then, T is overwritten by T whenever T >T , , ∀j=m−1…1. Finally, p is overwritten by p whenever p >p +1, ∀j=1…m−1. In Van Lishout’s implementation of MaxT, the [T ,…,T ] values are computed as before , but only the maximum values M are stored in memory (for i>0)
Van Lishout’s MaxT
| (1) Compute the test-statistics for all pairs, but only store the |
| vector where |
| (2) Initialise a vector |
| (3) Perform the following operations for |
| (a) Generate a random permutation of the trait column. |
| (b) Compute |
| (c) Compute the maximum |
| (d) Replace |
| (e) Force the monotonicity of the |
| (f) For each |
| (4) Divide all values of vector |
| replace |
Analysis of the computing times of the different steps of Van Lishout’s implementation of MaxT on a dataset containing 1 million SNPs
| Theoretical value | Numerical value | |
|---|---|---|
| Step 1 |
|
|
| Step 2 |
|
|
| Step 3 (a) |
|
|
| Step 3 (b) |
|
|
| Step 3 (c) |
|
|
| Step 3 (d) |
|
|
| Step 3 (e) |
|
|
| Step 3 (f) |
|
|
| Step 4 |
|
|
Mean and variance of the fitted parameters for datasets D 1−D 4
| D 1 | D 1 | D 2 | D 2 | D 3 | D 3 | D 4 | D 4 | |
|---|---|---|---|---|---|---|---|---|
| Mean | Var | Mean | Var | Mean | Var | Mean | Var | |
|
| 0.337 | 1.247×10−6 | 0.335 | 3.815×10−6 | 0.137 | 4.948×10−7 | 0.366 | 9.356×10−7 |
|
| 7.742 | 5.566×10−4 | 7.825 | 8.778×10−4 | 6.189 | 6.472×10−4 | 7.788 | 3.805×10−4 |
|
| 1.017 | 2.612×10−4 | 1.012 | 2.534×10−4 | 0.990 | 3.580×10−4 | 1.017 | 1.725×10−4 |
|
| 1.917 | 1.462×10−3 | 1.974 | 1.532×10−3 | 1.694 | 1.829×10−3 | 1.917 | 9.695×10−4 |
Step 3(c) of gammaMAXT
| (1) If (i modulo 20 = 1) estimate |
| (a) Set |
| (b) Randomly select integer |
| (c) If |
| (d) Repeat steps (b) and (c) until |
| (e) Sort |
| (f) Estimate |
| (g) Estimate |
| (h) Estimate |
| (i) Estimate |
| (2) If (i modulo 20 ≠ 1), use the latest estimated values of |
| (3) Sample |
Sample M when CDF is
| (a) Take a too high initial guess of |
| (a) Randomly select a real number |
| (c) If |
| (d) Repeat step (c) until |
Fig. 2MBMDR-4.2.2 parallel workflow. First, each cluster node performs a fair proportion of the T 0,1,…,T 0, values from Fig. 1 and saves the n highest into file top_c.txt. Second, a node aggregates all top_c.txt files and retrieves the overall n highest values, saved in topfile.txt. Third, each cluster node reads topfile.txt and performs an equitable fraction of the B permutations of Fig. 1, saving results into file permut_c.txt. Finally, a cluster node aggregates all permut_c.txt and produces the final output file
gammaMAXT parallel workflow
| (1) Each cluster node |
| values from Fig. |
| into file |
| (2) Upon termination of all computations at the previous step, a cluster node aggregates all |
| retrieves the overall |
| (3) Each cluster node reads |
| of the |
| (a) Generate a random permutation of the trait column. |
| (b) Compute |
| (c) Execute step (3)(c) of the gammaMAXT algorithm to estimate |
| (d) Replace |
| (e) Force the monotonicity of the |
| (f) For each |
| Upon completion of all computations on node |
| (4) A cluster node sums all vectors from the |
| incremented by 1 and divided by |
| if |
Two-locus penetrance table used to create the simulated datasets D 1, D 2 and D 3
| b/b | b/B | B/B | |
|---|---|---|---|
| a/a | 0 | 0.1 | 0 |
| a/A | 0.1 | 0 | 0.1 |
| A/A | 0 | 0.1 | 0 |
Fig. 3Theoretical (green) versus predicted M values for D 1. 10 % is the optimal choice, leading to the lowest Kolmogorov-Smirnov distance
Execution times of MBMDR-4.2.2. The parallel workflow was tested on a 256-core computer cluster (Intel L5420 2.5 GHz). The sequential executions were performed on a single core of this cluster
| SNPs |
|
|
|
|
|---|---|---|---|---|
| Binary trait | Binary trait | Continuous trait | Continuous trait | |
| sequential execution | parallel workflow | sequential execution | parallel workflow | |
| 103 | 13 min 33 sec | 20 sec | 13 min 18 sec | 18 sec |
| 104 | 52 min 15 sec | 1 min 05 sec | 56 min 14 sec | 53 sec |
| 105 | 64 h 35 min | 22 min 15 sec | 70 h 03 min | 20 min 28 sec |
| 106 | ≈ 270 days | 25 h 12 min | ≈ 290 days | 24 h 06 min |
The results prefixed by the symbol “ ≈” are extrapolated
Observed FWER of MBMDR-4.2.2
| Set | Amount datasets | Observed FWER |
|---|---|---|
|
| 1000 | 4.5 % |
|
| 1000 | 6.2 % |
|
| 200 | 7 % |
|
| 200 | 6.5 % |
Power comparison between the gammaMAXT and the MaxT algorithms
| Heritability | gammaMAXT | MaxT |
|---|---|---|
| 0.0100 | 3.7 % | 4.2 % |
| 0.0125 | 17.9 % | 19.4 % |
| 0.0150 | 50.3 % | 51.5 % |
| 0.0175 | 67.0 % | 68.7 % |
| 0.0200 | 86.6 % | 87.9 % |
| 0.0225 | 94.3 % | 94.7 % |
| 0.0250 | 97.5 % | 97.8 % |
| 0.0275 | 99.2 % | 99.3 % |
| 0.0300 | 99.6 % | 99.6 % |