Xia Cao1, Jie Liu1, Maozu Guo2,3, Jun Wang4. 1. College of Computer and Information Science, Southwest University, Beibei, Chongqing, 400715, China. 2. School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, 100044, China. 3. Beijing Key Laboratory of Intelligent Processing for Building Big Data, Beijing, 100044, China. 4. College of Computer and Information Science, Southwest University, Beibei, Chongqing, 400715, China. kingjun@swu.edu.cn.
Abstract
BACKGROUND: Detecting single nucleotide polymorphism (SNP) interactions is an important and challenging task in genome-wide association studies (GWAS). Various efforts have been devoted to detect SNP interactions. However, the large volume of SNP datasets results in such a big number of high-order SNP combinations that restrict the power of detecting interactions. METHODS: In this paper, to combat with this challenge, we propose a two-stage approach (called HiSSI) to detect high-order SNP-SNP interactions. In the screening stage, HiSSI employs a statistically significant pattern that takes into account family wise error rate, to control false positives and to effectively screen two-locus combinations candidate set. In the searching stage, HiSSI applies two different search strategies (exhaustive search and heuristic search based on differential evolution along with χ2-test) on candidate pairwise SNP combinations to detect high-order SNP interactions. RESULTS: Extensive experiments on simulated datasets are conducted to evaluate HiSSI and recently proposed and related approaches on both two-locus and three-locus disease models. A real genome-wide dataset: breast cancer dataset collected from the Wellcome Trust Case Control Consortium (WTCCC) is also used to test HiSSI. CONCLUSIONS: Simulated experiments on both two-locus and three-locus disease models show that HiSSI is more powerful than other related approaches. Real experiment on breast cancer dataset, in which HiSSI detects some significantly two-locus and three-locus interactions associated with breast cancer, again corroborate the effectiveness of HiSSI in high-order SNP-SNP interaction identification.
BACKGROUND: Detecting single nucleotide polymorphism (SNP) interactions is an important and challenging task in genome-wide association studies (GWAS). Various efforts have been devoted to detect SNP interactions. However, the large volume of SNP datasets results in such a big number of high-order SNP combinations that restrict the power of detecting interactions. METHODS: In this paper, to combat with this challenge, we propose a two-stage approach (called HiSSI) to detect high-order SNP-SNP interactions. In the screening stage, HiSSI employs a statistically significant pattern that takes into account family wise error rate, to control false positives and to effectively screen two-locus combinations candidate set. In the searching stage, HiSSI applies two different search strategies (exhaustive search and heuristic search based on differential evolution along with χ2-test) on candidate pairwise SNP combinations to detect high-order SNP interactions. RESULTS: Extensive experiments on simulated datasets are conducted to evaluate HiSSI and recently proposed and related approaches on both two-locus and three-locus disease models. A real genome-wide dataset: breast cancer dataset collected from the Wellcome Trust Case Control Consortium (WTCCC) is also used to test HiSSI. CONCLUSIONS: Simulated experiments on both two-locus and three-locus disease models show that HiSSI is more powerful than other related approaches. Real experiment on breast cancer dataset, in which HiSSI detects some significantly two-locus and three-locus interactions associated with breast cancer, again corroborate the effectiveness of HiSSI in high-order SNP-SNP interaction identification.
Entities:
Keywords:
Differential evolution; Family wise error rate; Genome-wide association studies; High-order SNP interactions; Statistically significant pattern
It has been widely recognized that single nucleotide polymorphisms (SNPs) are associated with a variety of human complex diseases. Genome-wide association study (GWAS) has became a powerful tool for detecting SNPs and detected hundreds of single SNPs associated with complex diseases [1]. However, these single SNPs can only explain a portion of the theoretical estimated heritability of complex diseases [2]. Complex diseases are influenced by various genetic variants and environmental factors. Therefore, SNP-SNP interactions defined as various joint effects of genetic variations should also be considered to better understand etiology of complex diseases.Existing approaches for searching two-locus SNP interactions can be grouped into three categories: exhaustive search, stochastic search and machine learning based search. Methods based on exhaustive search enumerate all possible SNP combinations of two-locus and perform interaction tests for each combination. Ritchie et al. [3] proposed the multifactor dimensionality reduction (MDR) approach, which partitions genotype combinations into two classes and exhaustively searches the best SNP combination by predicting the disease status. Stochastic methods use the random sampling procedures to search the space of SNP combinations. Zhang et al. [4] proposed a Bayesian epistasis association mapping approach, which iteratively uses the Markov chain Monte Carlo to search two-locus interactions. Machining learning methods [5], such as random forest, neural networks and support vector machines, also have been applied to discover SNP interactions. Bureau et al. [6] focused on measures of predictive importance and applied random forest to discover predictive polymorphisms or markers of a phenotype, which are likely to affect disease susceptibility.There are some challenges in detecting high-order SNP interactions. The first is the computational challenge. Although the overall complexity is linear with the number of individuals, it becomes exponential with the increase of locus. For example, for a dataset containing 1 million SNPs, the number of combinations to be tested is tremendous: 5 ×1011 pairwise interactions, 1.7 ×1017 3-way interactions, 8.3 ×1027 5-way interactions [7]. Therefore, exhaustively searching high-order epistatic interactions would be a heavy computational burden. The second is the statistical challenge. To balance the false-positive rate and false-negative rate, many stringent significance thresholds should be applied.Several high-order SNP interactions detection approaches were developed to attack the aforementioned challenges. Xie et al. [8] proposed EDCF (Epistasis Detector based on the Clustering of relatively Frequent items) to detect multi-locus epistatic interactions based on two-locus interaction models. EDCF is a two-stage method, it firstly groups all genotype combinations into three clusters and then evaluates the significance of interaction modules based on χ2-test. Guo et al. [9] proposed a two-stage method called DCHE (Dynamic Clustering for High-order genome-wide Epistatic interactions detecting). DCHE dynamically groups all genotype combinations into three to six subgroups, and then separately adopts χ2-test to evaluate the candidate pairwise combination in each subgroup. Yang et al. [10] proposed a stochastic search method (DECMDR). DECMDR combines the differential evolution algorithm [11] with a classification based multifactor-dimensionality reduction to detect the significant associations between cases and controls among all possible SNP combinations.These high-order SNP interactions detection approaches still have some limitations. Most of these approaches do not control false positives and apply Bonferroni correction [12] in multiple hypothesis test for GWAS. Bonferroni correction is simple, but it is often overly conservative when the number of SNP is very huge. The correction comes at the cost of increasing the probability of producing false negatives, i.e., reducing statistical power [13, 14].In this paper, we propose a two-stage approach named HiSSI to detect high-order SNP interactions based on candidate pairwise SNP combinations. In the screening stage, a statistically significant pattern considering family wise error rate (FWER) is introduced to control false positives in multiple hypothesis test. HiSSI makes the statistically significant pattern faster and more memory-efficient via a fast Westfall-Young permutation testing [15], and obtains a corrected significant threshold to screen significant pairwise SNP combination candidates. In the search stage, HiSSI employs two different strategies to search high-order SNP interactions. For a small set, HiSSI uses the exhaustive search. For a large set, HiSSI employs a heuristic search technique named differential evolution (DE) algorithm [10, 11] along with χ2-test. We conduct simulation studies with various two-locus and three-locus disease models to comparatively study the power of HiSSI and that of state-of-the-art approaches, including EDCF [8], DCHE [9] and DECMDR [10]. The empirical study demonstrates that our proposed HiSSI is generally more powerful than these approaches. Further study on a real breast cancer (BC) dataset shows that HiSSI also detects some two-locus and three-locus combinations that are significantly associated with breast cancer. These experiments prove that HiSSI is capable to identify high-order interactions from genome-wide data.
Methods
Problem statement
Suppose a genotype dataset include N samples and M SNPs. We use y to denote the phenotype (including case and control), P(s(i,j)) to denote the pattern of pairwise SNP (i-th SNP and j-th SNP) combinations. Let N1 and N0 denote the number of affected samples (i.e., cases) and the number of controls.Suppose a SNP with a major allele A, and a minor allele a. Three genotypes of a SNP are the homozygous reference genotype (AA), the heterozygous genotype (Aa), and the homozygous variant genotype (aa). Generally, these three genotypes are encoded as {0, 1, 2}. In this paper, for k-th (k={0,1,2}) genotype of i-th (i={1,2,…,M}) SNP, we encode it as {0, 1} by the ratio of the number of case and the number of control, which can be calculated as:where N0 and N1 denote the number of k-th genotype of i-th SNP under control and case set, respectively; is a balance factor to control the influence of unbalanced GWAS datasets. If R>1, the genotype is encoded by 1; otherwise, encoded by 0. In this way, each SNP is encoded by {0, 1}. For each pairwise SNP combination P(s(i,j)), it is also encoded by {0, 1} instead of nine genotype combinations as follows:where S and S denote the i-th and j-th SNPs.In the screening stage, HiSSI attempts to find all significant candidate pairwise SNP combinations (snp,snp) such that P(s(i,j)) is statistically associated with the phenotype y after correction for multiple hypothesis testing. In the search stage, HiSSI tries to find out high-order SNP interactions based on candidate set. The whole procedure of HiSSI is illustrated in Fig. 1. The following two subsections elaborate on these two stages, respectively.
Fig. 1
Procedure overview of HiSSI
Procedure overview of HiSSI
Stage 1: screening pairwise SNP combinations
For each pairwise SNP combination P(s(i,j)), we can obtain the 2 ×2 contingency table for P(s(i,j)) and phenotype y as Table 1.
Table 1
Contingency table for two-locus combinations and phenotype
Variables
P(s(i,j))=1
P(s(i,j))=0
Total
y=case
a
N1−a
N1
y=control
x−a
N0−(x−a)
N0
Total
x
N−x
N
x is the number of samples whose P(s(i,j)) take value 1. a has the same interpretation as x but restricted to cases.
Contingency table for two-locus combinations and phenotypex is the number of samples whose P(s(i,j)) take value 1. a has the same interpretation as x but restricted to cases.HiSSI evaluates the association between the phenotype y and the variable P(s(i,j)) by χ2-test [16]. Suppose p is the corresponding p-value of the two-locus combination (snp,snp) derived from the contingency table. If p≤δ∗ (δ∗ is the corrected significant threshold), HiSSI deems the two-locus combination is significant and places it into candidate set.HiSSI utilizes the minimum attainable p-value and the set of testable SNP combinations at significance level δ to make the permutation-testing more fast and efficient. Since the minimum attainable p-value Ψ(x) is symmetric about N/2 [17], there are only +1 different values of Ψ(x) denoted as , which is a monotonically decreasing sequence. Σ(δ) is the testable region, one two-locus combination (snp,snp) is testable if and only if the marginal x∈Σ(δ). can be computed by starting from Σ(δ0)=[0,N] and iteratively shrinked to obtain Σ(δ) from Σ(δ).At initialization, HiSSI generates J phenotypes based on J permutations, initializes J different minimum p-values =1 (the maximum value a p-value can take) and initializes the corrected significance threshold as δ=δ1, δ1 is the largest value that Ψ(x) can take other than the trivial value δ0=1, which deems all SNP pairs that are testable and significant. Then, HiSSI computes the corresponding testability region Σ and . For each two-locus combination, HiSSI computes x( and check if the combination is testable or not in the current corrected significance level δ. If x∈Σ, the combination is testable and needs to be processed. In such case, HiSSI does not need to exhaustively analyze all two-locus combinations, and only needs to analyze these combinations whose marginal x is in testable region Σ. By updating the minimum p-values by J permutations, FWER can be obtained. If FWER (δ)>α, k needs to be increased so as to decrease FWER (δ) to control the false positives. The corrected significance threshold δ∗ can be calculated as follows:The above processes are summarized in Algorithm 1.Once the corrected significance threshold δ∗ is obtained, for each two-locus combination, HiSSI computes the marginal x and a, which is the number of under the cases, and then computes the corresponding p-value via χ2-test. If p≤δ∗, HiSSI deems the combination is significant and places it into the candidate set.
Stage 2: high-order SNP interactions detection
In the search stage, HiSSI provides two strategies (exhaustive search and DE-based search) to search high-order SNP interactions based on candidate set.
Exhaustive search for small candidate set
Exhaustive search is affordable when the candidate set is small and has a larger chance to detect high-order SNP interactions than heuristic search. HiSSI applies exhaustive search on a small candidate set. To exhaustively search K-SNP (K≥3) interactions, HiSSI combines all candidate SNP pairs to a set of K-SNP, and computes the corresponding p-value obtained by χ2-test of K-SNP. HiSSI reports these combinations whose p-values are smaller than the corrected significant thresholds δ∗ of K SNPs, obtained by the Algorithm 1.
Heuristic search for large candidate set
For a large candidate set, HiSSI employs a heuristic search approach based on differential evolution (DE) algorithm [11, 18–23] with χ2-test to identify high-order SNP interactions. DE is a powerful heuristic and parallel direct search approach with few control variables. Here, we take K=3 as an example to illustrate the process of detecting high-order interactions. The DE-based search strategy is presented as follows.Initialization: for the candidate set C obtained from the first stage, a target vector is employed to represent a combination of three SNPs from C and defined as: :where ps is the population size, i.e., the number of randomly generated target vectors; g means the g-th iteration. i is the i-th target vector in the population, f(j=1,2,3) represents one of the three SNPs in the i-th target vector in the g-th generation. At the initialization (g=0), f(j=1,2,3) are randomly generated as follows:where upper and lower are the upper and lower bounds of the indexes of the candidate set. rand([0,1)) randomly generates a uniformly distributed random value within the range [0,1).Mutation: in the mutation operation, each target vector generates a mutant vector:where r1, r2 and r3∈(1,2,…,ps) are the random indices of the population, and they are mutually different. X, X and X are the selected three target vectors. F∈[0,2] is a real and constant factor that controls the amplification of differential variation (X−X).Recombination: in the recombination operation, the mutant vector V and the current target vector X are incorporated to generate a trial vector:wherewhere randb(j) is the j-th evaluation of a uniform random number generator with an outcome in [0,1], CR∈[0,1] is the crossover constant. rnbr(i) is a randomly chosen index in (1,2,3), it ensures that U obtains at least one parameter from V.Boundary constraints [10]: a trail vector must be checked whether it is a feasible SNP combination (i.e., no parameters in the trial vector outside of the problem space), and can be adjusted as follows:Selection: the selection operation determines whether the target vector X is replaced by the trial vector U in the next generation or not. An evaluation function is used to evaluate the target and trial vectors. Here, HiSSI employs the chi-square test as the evaluation function. If the corresponding p-value of trial vector U obtained by chi-square tese yields a better value than the corresponding p-value of target vector X, namely p(U)
Through the above four iterative operations (step (2)-(5)), the value of the target vector can be improved by competing between target vectors and trial vectors. These four operations are repeated until the maximum number of generations (g) is reached, and the target vector with the best fitness value is the detected high-order SNP interaction.
FWER control
In GWAS, SNP interaction detection leads to a multiple hypothesis testing problem that generates lots of false positives. To alleviate this problem, Boferroni correction [12] and permutation-testing [24], are widely used for correcting the multiple testing problem. However, Bonferroni correction only works when the number of test patterns is known in advance and small [14]. HiSSI applies a fast permutation-testing method [15] to strictly control the family wise error rate (FWER), defined as the possibility of producing at least one false positive. In the permutation-testing, HiSSI generates a re-sampled dataset by randomly permuting the phenotype. Then, HiSSI computes the minimum p-value across all genotype combinations. Repeating the permutation for a sufficiently number (J) of times, it obtains J different minimum p-values . The FWER can be evaluated as:where 1[·] is an indicator function which takes value 1 if its argument is true and 0 otherwise; δ is the corrected significance threshold.FWER control requires FWER ≤α with α being the desired significant threshold. By doing this, the corrected significant threshold δ is chosen appropriately. The optimal δ∗ is obtained by solving the same optimization problem as Equation (3). In addition, the optimization problem also can yield the highest power (the probability of detecting true positives), and strictly control the FWER.
Results
In this section, we evaluate the performance of HiSSI on both simulated and real datasets. In the simulated study, we compare HiSSI with EDCF [8], DCHE [9], DECMDR [10] and HiSSI-BC on different disease models (including two-locus and three-locus) with different parameters settings. HiSSI-BC is a variant of HiSSI, it obtains the corrected significant threshold using the Bonferroni correction. We adopt the same measure of power suggested by Wan et al. [25] as follows:where D′ is the number of datasets where exist true SNP interactions, and D is the number of all datasets. The definition of marginal effect size λ of a disease locus is the same as the one used in Zhang et al. [4]:where p and p denote the penetrance of genotype AA and Aa, respectively. For the real study, we apply HiSSI on the real breast cancer (BC) GWAS dataset collected from Wellcome Trust Case Control Consortium (WTCCC) [26].
Experiments on simulated datasets
To do comprehensive experimental comparison, we conduct simulation experiments on both two-locus and three-locus disease models. Since the number of candidate SNP combinations is small after screening in the first stage, we apply exhaustive search to detect high-order interaction.
Two-locus disease models
Three two-locus disease models (Model1, Model2 and Model3) are used to compare HiSSI with EDCF [8], DCHE [9], DECMDR [10] and HiSSI-BC. Model1 and Model2 are proposed by Marchini et al. [27], where Model1 with a threshold effect, and Model2 with a multiplicative effect. Model3 is proposed by Zhang et al. [4] with an additive effect. The marginal effect size is relatively small in the simulation study, λ=0.2 for Model1, Model2, and Model3. Minor allele frequencies (MAFs) are the same for both loci at three levels: MAF = 0.1, 0.2 and 0.4; and for Linkage disequilibrium (LD), r2 is set to 0.7 and 1.0: r2=0.7 is simulated for disease loci ungenotyped, but their LD markers genotyped; r2=1.0 is simulated for directly genotyped disease loci. We use the same simulation program as [4] to simulate 100 datasets under each parameter setting for each disease model. Each dataset contains 100 SNPs, and the sample size is fixed to 1000, 2000 and 4000.Figure 2 reveals the performance of different approaches on these three models. The power of all methods improves significantly when the sample size increasing from 2000 to 4000, and r2 changing from 0.7 to 1. However, the power of most approaches decreases as the MAFs of disease associated markers varying from 0.2 to 0.4. The trend is consistent with the results in [4, 27].
Fig. 2
Powers of different approaches on three two-locus disease models (Models 1–3) with 100 SNPs. Powers of DCHE, DECMDR, EDCF, HiSSI and HiSSI-BC on three two-locus disease models with different minor allele frequency (MAF), sample size (N) and linkage disequilibrium (LD); HiSSI-BC is a variant of HiSSI that uses the Bonferroni correction to obtain the corrected significant threshold. N0 is the number of controls, N1 is the number of cases, and M is the number of SNPs. The absence of a bar indicates no power. (a) Model1; (b) Model2; (c) Model3
Powers of different approaches on three two-locus disease models (Models 1–3) with 100 SNPs. Powers of DCHE, DECMDR, EDCF, HiSSI and HiSSI-BC on three two-locus disease models with different minor allele frequency (MAF), sample size (N) and linkage disequilibrium (LD); HiSSI-BC is a variant of HiSSI that uses the Bonferroni correction to obtain the corrected significant threshold. N0 is the number of controls, N1 is the number of cases, and M is the number of SNPs. The absence of a bar indicates no power. (a) Model1; (b) Model2; (c) Model3Across all models, HiSSI outperforms HiSSI-BC, which evidences that the adopted permutation test is more effective than Bonferroni correction for controlling false positives in multiple hypothesis test. In addition, HiSSI also has a better performance than other approaches (EDCF, DCHE and DECMDR) for Model1–Model3 except some cases, Model1 with N = 2000, r2 = 0.7, MAF = 0.4, and Model2 with r2 = 1.0, MAF = 0.4. In these cases, HiSSI has a lower power than EDCF and DCHE. That is because HiSSI may lose some genetic associations, since it partitions two-locus genotype combinations into two groups, which is much smaller than the number of genotypes. On the contrary, EDCF and DCHE partition genotype combinations into more groups than HiSSI; EDCF has three groups, DCHE has three to six groups; whose numbers are larger than two and can retain more genetic information. In most cases, DECMDR has the lowest power, since it applies heuristic search and only reports the optimal solution. Another interesting observation is that the power of EDCF drastically decreases when N = 4000 with r2 = 0.7 and MAF = 0.2. One possible reason is that EDCF divides each three-locus combinations into three groups and uses the chi-square test with two degrees of freedom to measure the significance, resulting in more false positives.In addition, high-dimensional simulation datasets with 1000 SNPs, 2000 and 4000 samples on Model2 are also used to test HiSSI and other comparing approaches. The settings of MAF and LD are the same as before and the simulation datasets are also generated by the same simulation program as Zhang et al. [4].Figure 3 reveals the performance of different approaches on Model2 with 1000 SNPs. Similarly, the power of all approaches significantly increases when the sample size increase from 2000 to 4000, r2 varies from 0.7 to 1; and decreases when MAF varies from 0.2 to 0.4. For the model with 1000 SNPs, HiSSI still outperforms HiSSI-BC, which confirms the effectiveness of permutation test on high-dimensional datasets. HiSSI has a better performance than other approaches except r2 = 1.0, MAF = 0.4. In such case, HiSSI has a lower power than EDCF and DCHE. EDCF loses its power when N = 4000 with r2 = 0.7 and MAF = 0.2. All these results are consistent with the results on the small simulation datasets with Model2.
Fig. 3
Powers of different approaches on Model 2 with 1000 SNPs. Powers of DCHE, DECMDR, EDCF, HiSSI and HiSSI-BC on Model2 under different minor allele frequency (MAF) and linkage disequilibrium (LD) with 1000 SNPs, 2000 and 4000 samples; HiSSI-BC is a variant of HiSSI which uses the Bonferroni correction to obtain the corrected significant threshold. N0 is the number of controls, N1 is the number of cases, and M is the number of SNPs. The absence of a bar indicates no power
Powers of different approaches on Model 2 with 1000 SNPs. Powers of DCHE, DECMDR, EDCF, HiSSI and HiSSI-BC on Model2 under different minor allele frequency (MAF) and linkage disequilibrium (LD) with 1000 SNPs, 2000 and 4000 samples; HiSSI-BC is a variant of HiSSI which uses the Bonferroni correction to obtain the corrected significant threshold. N0 is the number of controls, N1 is the number of cases, and M is the number of SNPs. The absence of a bar indicates no power
Three-locus disease models
We use two three-locus disease models (Model4 and Model5) to test the ability of HiSSI in detecting high-order SNP interactions. Model4 is a three-locus interaction model proposed by Zhang et al. [4]. Model5 is the extension of Model1, which is a two-locus interaction model with a threshold effect. The sample size increases from 2000 to 4000; the minor allele frequencies (MAFs) is set to 0.1, 0.2, and 0.4; the r2 changes from 0.7 to 1.0; and the marginal effect is set to λ=0.3 for Model4 and Model5. We use the same simulation program in Zhang et al. [4] to simulate 100 datasets under each parameter setting for each disease model, and each dataset contains 100 SNPs.Figure 4 shows the performance of different approaches on two three-locus disease models for high-order interactions detection. The power of all approaches significantly improves with the sample size increasing from 2000 to 4000, and r2 changing from 0.7 to 1. Besides, for Model5, the power of most approaches decreases with MAFs of the disease associated markers varying from 0.2 to 0.4. This trend is consistent with the results in two-locus disease model. However, the trend is not obvious for Model4 with MAF varying from 0.1 to 0.4.
Fig. 4
Powers of different approaches on two three-locus disease models (Models 4–5) with 100 SNPs. Powers of DCHE, DECMDR, EDCF, HiSSI and HiSSI-BC on two three-locus disease models with different minor allele frequency (MAF), sample size (N) and linkage disequilibrium (LD); HiSSI-BC is a variant of HiSSI that uses the Bonferroni correction to obtain the corrected significant threshold. N0 is the number of controls, N1 is the number of cases, and M is the number of SNPs. The absence of a bar indicates no power. (a) Model4; (b) Model5
Powers of different approaches on two three-locus disease models (Models 4–5) with 100 SNPs. Powers of DCHE, DECMDR, EDCF, HiSSI and HiSSI-BC on two three-locus disease models with different minor allele frequency (MAF), sample size (N) and linkage disequilibrium (LD); HiSSI-BC is a variant of HiSSI that uses the Bonferroni correction to obtain the corrected significant threshold. N0 is the number of controls, N1 is the number of cases, and M is the number of SNPs. The absence of a bar indicates no power. (a) Model4; (b) Model5For these two models, HiSSI again has a better performance than HiSSI-BC, which shows that permutation test is also more effective than Bonferroni correction in detecting high-order interactions. In addition, HiSSI obtains the highest power for Model4–Model5 except some cases, Model4 with N = 2000, r2 = 1.0, MAF = 0.2, 0.4; and N = 4000, MAF = 0.4. In these cases, HiSSI has a lower power than EDCF and DCHE. That is due to the same reason as two-locus disease models, resulting more false positives in HiSSI. For both models, the power of EDCF still drastically decreases when N = 4000 with r2 = 0.7 and MAF = 0.2, that is consistent with the results in two-locus disease model. Since EDCF, DCHE and HiSSI-BC all employ corrected Bonferroni correction to calculate the threshold, from the power between HiSSI between these methods, we can conclude that permutation test is more effective than Bonferroni correction for controlling false positives in multiple hypothesis test. In most cases, DECMDR has the lowest power, since it applies heuristic search in a larger search space and only reports the optimal solution.Besides, high-dimensional simulation datasets with 1000 SNPs, 2000 and 4000 samples on Model4 and Model5 are also used to test HiSSI and other comparing approaches. The settings about MAF and LD are the same as the simulation datasets with 100 SNPs. Figure 5 reveals the performance of different approaches on Model4 and Model5 with 1000 SNPs. The trend for power of all approaches is consistent with that on the small simulation datasets. For both the two models, HiSSI still has a better power than HiSSI-BC; and HiSSI obtains the highest power except Model4 with MAF = 0.4, and N = 2000, r2 = 1.0, MAF = 0.2. In these cases, the power of HiSSI is lower than EDCF and DCHE. All these results are consistent with the results on small simulation datasets.
Fig. 5
Powers of different approaches on two three-locus disease models (Models 4–5) with 1000 SNPs. Powers of DCHE, DECMDR, EDCF, HiSSI and HiSSI-BC on two three-locus models under different minor allele frequency (MAF) and linkage disequilibrium (LD) with 1000 SNPs, 2000 and 4000 samples; HiSSI-BC is a variant of HiSSI that uses the Bonferroni correction to obtain the corrected significant threshold. N0 is the number of controls, N1 is the number of cases, and M is the number of SNPs. The absence of a bar indicates no power. (a) Model4; (b) Model5
Powers of different approaches on two three-locus disease models (Models 4–5) with 1000 SNPs. Powers of DCHE, DECMDR, EDCF, HiSSI and HiSSI-BC on two three-locus models under different minor allele frequency (MAF) and linkage disequilibrium (LD) with 1000 SNPs, 2000 and 4000 samples; HiSSI-BC is a variant of HiSSI that uses the Bonferroni correction to obtain the corrected significant threshold. N0 is the number of controls, N1 is the number of cases, and M is the number of SNPs. The absence of a bar indicates no power. (a) Model4; (b) Model5In addition, we also conduct experiments on two models (two-locus and three-locus models) without marginal effect with 100 and 1000 SNPs. The experimental settings, results and analysis can be found in Additional file 1. All simulated models used in simulated experiments are showed in Additional file 2.
Experiment on the breast cancer dataset
A real breast cancer dataset (BC) collected from WTCCC project [26] is used to further evaluate HiSSI. It is reported that breast cancer is caused by a combination of genetic and environmental risk factors [28]. The BC dataset contains 15347 SNPs from 1045 affected individuals and 3893 normal individuals. Quality control is performed to exclude very low rate samples and SNPs. For a SNP, if its call rate <95% across all samples, or its p-value (Hardy-Weinberg equilibrium) <0.0001 in controls, or with MAF <0.1, the SNP will be excluded. For a sample, if its call rate <98%, the sample will be excluded. Through the quality control, the BC dataset contains 1045 case samples and 3893 control samples with 5607 SNPs.Some significant two-locus and three-locus combinations on BC dataset identified by HiSSI is shown in Table 2. In the two-locus combinations, (rs1108842, rs2289247) is in gene GNL3 on chromosome 3. The protein encoded by GNL3 may interact with p53 and may be involved in tumorigenesis. (rs2242665, rs2856705) is on chromosome 6, where rs2856705 is susceptibly associated with breast cancer [29]. (rs1801197, rs6971091) is on chromosome 7, where rs1801197 is located in gene CALCR. It is evidenced that rs1801197/CALCR can lead to breast cancer [29]. (rs365990, rs7158731) is on chromosome 14, where rs365990 is in gene MYH6, and rs7158731 is in gene ZNF839. MYH6 encodes the alpha heavy chain subunit of cardiac myosin, and mutations in this gene cause familial hypertrophic cardiomyopathy and atrial septal defect 3. It is reported that MYH6 and ZNF839 are associated with breast cancer [29]. (rs8059973, rs3785181) is on chromosome 16, where rs8059973 is in gene DBNDD1, rs3785181 is in gene GAS11. rs8059973/DBNDD1 is associated with breast cancer [29]. GAS11 includes 11 exons spanning 25 kb and maps to a region of chromosome 16 that is sometimes deleted in breast and prostrate cancer. This gene is a putative tumor suppressor gene and is reported as being associated with breast cancer [30]. (rs2822558, rs2822787) is on chromosome 21, where rs2822558 is located in gene ABCC13. ABCC13 is a member of the superfamily of genes encoding ATP-binding cassette (ABC) transporters. It is reported that rs2822558/ABCC13 is related to breast cancer [29].
Table 2
Significant two-locus and three-locus combinations identified by HiSSI on WTCCC BC data
Significant Interaction
Chromosome and Related Genes
Single-Locusp-Value
Interactionp-Value
(rs1108842, rs2289247)
(chr3: GNL3, chr3: GNL3)
(1.139×10−2,5.981×10−1)
6.048×10−45
(rs1130643, rs10017772)
(chr4: SPARCL1, chr4: DCHS2)
(2.823×10−1,3.830×10−1)
2.865×10−8
(rs3761967, rs715748)
(chr5: BDP1, chr5: BDP1)
(1.720×10−1,4.073×10−1)
1.238×10−54
(rs2242665, rs2856705)
(chr6: SLC44A4, chr6: *)
(3.484×10−4,6.178×10−8)
4.369×10−25
(rs1801197, rs6971091)
(chr7: CALCR, chr7: FAM71F1)
(4.983×10−2,6.728×10−1)
2.504×10−10
(rs365990, rs7158731)
(chr14: MYH6, chr14: ZNF839)
(8.807×10−3,3.319×10−3)
1.636×10−6
(rs8059973, rs3785181)
(chr16: DBNDD1, chr16: GAS11)
(3.019×10−4,1.464×10−3)
4.050×10−7
(rs2822558, rs2822787)
(chr21: ABCC13, chr21: SAMSN1-AS1)
(6.630×10−3,1.662×10−1)
4.207×10−21
(rs879882, rs2523608, rs805262)
(chr6: POU5F1, chr6: HLA-B, chr6: BAG6)
(2.711×10−1,7.096×10−1,5.836×10−4)
1.030×10−11
*Indicates that the related gene is unknown.
Significant two-locus and three-locus combinations identified by HiSSI on WTCCC BC data*Indicates that the related gene is unknown.In the three-locus combination, (rs879882, rs2523608, rs805262) is on chromosome 6. rs879882 is in gene POU5F1, which plays a key role in embryonic development and stem cell pluripotency [31]. rs2523608 is located at gene HLA-B and belongs to human leukocyte antigen (HLA) class I heavy chain paralogs. HLA class I antigen expression plays a central role in the immune system and is closely related to the aggressiveness and prognosis of BC [32]. rs805262 belongs to gene BAG6, which was first characterized as part of a cluster of genes located within the human major histocompatibility complex class III region. In addition, BAG6 is implicated in the control of apoptosis and is associated with basal cell carcinoma [33]. These identified significant two-locus and three-locus combinations demonstrate that HiSSI is capable to detect SNP interactions on genome-wide data.
Parameter setting
In the screening stage, we set J=100 (number of permutations), α=0.05 (target FWER).In the search stage, there are four common parameters of DE algorithm: population size (ps), generation size (g), the scaling factor (F) and crossover constant (CR). We set these parameters according to previous studies [10, 34] For real dataset, we set: ps=500, g=500, F=0.5 and CR=0.5.
Discussion
Comparison between HiSSI and other approaches
Comparison between HiSSI and HiSSI-BC: HISSI-BC is a variant of HiSSI, the main difference between HiSSI and HiSSI-BC is that HiSSI employs a fast permutation test to obtain corrected significant threshold, while HiSSI-BC uses the Bonferroni correction. For all simulation datasets on different disease models (including two-locus and three-locus), HiSSI always outperforms HiSSI-BC, which demonstrates that permutation test is more effective than Bonferroni correction in correcting multiple testing.Comparison between HiSSI and EDCF, DCHE: HiSSI utilizes a statistically significant pattern combined with permutation test to partition genotype combinations into two subgroups, which considers FWER to control false positives; while EDCF partitions genotype combinations into three subgroups, and DCHE dynamically partitions genotype combinations into three to six subgroups. Moreover, both EDCF and DCHE utilize the Bonferroni correction to correct multiple testing. The results on simulation datasets reveals HiSSI has a better performance than EDCF and DCHE, which proves the effectiveness of significant pattern in controlling false positives.Comparison between HiSSI and DECMDR: both DECMDR and HiSSI utilize differential evolution (DE) algorithm to identify SNP interactions. DECMDR utilizes DE algorithm in the whole search space and uses the classification based multifactor-dimensionality reduction (CMDR) as a fitness measure to evaluate values of solutions in the DE process. While HiSSI utilizes DE algorithm in a smaller search space based on candidate set and the chi-square test as the fitness measure in DE process, it has a higher probability to search the true interactions. Since MDR is time-consuming and only reports the optimal solution, DECMDR has a lower power than other approaches in most cases.
The advantages and limitations of HiSSI
The development of HiSSI is to overcome of the limitations of existing approaches on detecting high-order SNP interactions from genome-wide data. HiSSI displays several advantages over existing methods:HiSSI applies a FWER-constrained statistically significant pattern to strictly control false positives in multiple hypothesis test.HiSSI utilizes a fast permutation testing to obtain corrected significant threshold, which avoids analyzing all two-locus combinations, greatly reduces the total runtime; and also avoids the conservatism of Bonferroni correction.HiSSI provides two alternative search strategies, exhaustive search and heuristic search for different sizes of GWAS datasets.The running time of HiSSI is relatively long compared with other approaches. It is a general problem for existing approaches that employ permutation test. Although HiSSI utilizes a fast permutation test, which is faster than traditional permutation test, it is still time-consuming compared with heuristic algorithms and those approaches with Bonferroni correction. In addition, HiSSI does not directly control the main effects, which may introduce the negative influence of main effects for pairwise SNP combinations; and HiSSI only partitions genotype combinations into two groups, which may lose some genetic association. These limitations may degrade the performance of HiSSI. Future work can be extended to address the above limitations.
Conclusions
Detecting potential SNP-SNP interactions in GWAS is an indispensable and challenging problem. In this paper, we proposed a two-stage method called HiSSI to solve the problem. In the screening stage, HiSSI controls the false positives using an efficient statistically significant pattern that considers the family wise error rate, and obtains significant candidate pairwise SNP combinations. In the search stage, HiSSI utilizes two different strategies, exhaustive search and DE-based search, to detect high-order SNP interactions. Exhaustive search is applied to a small candidate set, and DE-based search is used for a large candidate set. A series of simulation experiments on both two-locus and three-locus disease models show that HiSSI is more powerful than other related approaches in detecting SNP interactions. Further experiment on a real WTCCC dataset corroborates that HiSSI is capable to identify high-order SNP interactions from genome-wide data.Additional file 1 Experiments on models without marginal effect. Two disease models (a two-locus and a three-locus models) without marginal effect are used to test the performances of different approaches under different parameter settings.Additional file 2 Simulated disease models. Simulated two-locus and three-locus models used in the simulation experiments are listed in tables.
Authors: Xiang Wan; Can Yang; Qiang Yang; Hong Xue; Xiaodan Fan; Nelson L S Tang; Weichuan Yu Journal: Am J Hum Genet Date: 2010-09-10 Impact factor: 11.025
Authors: Paul R Burton; David G Clayton; Lon R Cardon; Nick Craddock; Panos Deloukas; Audrey Duncanson; Dominic P Kwiatkowski; Mark I McCarthy; Willem H Ouwehand; Nilesh J Samani; John A Todd; Peter Donnelly; Jeffrey C Barrett; Dan Davison; Doug Easton; David M Evans; Hin-Tak Leung; Jonathan L Marchini; Andrew P Morris; Chris C A Spencer; Martin D Tobin; Antony P Attwood; James P Boorman; Barbara Cant; Ursula Everson; Judith M Hussey; Jennifer D Jolley; Alexandra S Knight; Kerstin Koch; Elizabeth Meech; Sarah Nutland; Christopher V Prowse; Helen E Stevens; Niall C Taylor; Graham R Walters; Neil M Walker; Nicholas A Watkins; Thilo Winzer; Richard W Jones; Wendy L McArdle; Susan M Ring; David P Strachan; Marcus Pembrey; Gerome Breen; David St Clair; Sian Caesar; Katharine Gordon-Smith; Lisa Jones; Christine Fraser; Elaine K Green; Detelina Grozeva; Marian L Hamshere; Peter A Holmans; Ian R Jones; George Kirov; Valentina Moskivina; Ivan Nikolov; Michael C O'Donovan; Michael J Owen; David A Collier; Amanda Elkin; Anne Farmer; Richard Williamson; Peter McGuffin; Allan H Young; I Nicol Ferrier; Stephen G Ball; Anthony J Balmforth; Jennifer H Barrett; Timothy D Bishop; Mark M Iles; Azhar Maqbool; Nadira Yuldasheva; Alistair S Hall; Peter S Braund; Richard J Dixon; Massimo Mangino; Suzanne Stevens; John R Thompson; Francesca Bredin; Mark Tremelling; Miles Parkes; Hazel Drummond; Charles W Lees; Elaine R Nimmo; Jack Satsangi; Sheila A Fisher; Alastair Forbes; Cathryn M Lewis; Clive M Onnie; Natalie J Prescott; Jeremy Sanderson; Christopher G Matthew; Jamie Barbour; M Khalid Mohiuddin; Catherine E Todhunter; John C Mansfield; Tariq Ahmad; Fraser R Cummings; Derek P Jewell; John Webster; Morris J Brown; Mark G Lathrop; John Connell; Anna Dominiczak; Carolina A Braga Marcano; Beverley Burke; Richard Dobson; Johannie Gungadoo; Kate L Lee; Patricia B Munroe; Stephen J Newhouse; Abiodun Onipinla; Chris Wallace; Mingzhan Xue; Mark Caulfield; Martin Farrall; Anne Barton; Ian N Bruce; Hannah Donovan; Steve Eyre; Paul D Gilbert; Samantha L Hilder; Anne M Hinks; Sally L John; Catherine Potter; Alan J Silman; Deborah P M Symmons; Wendy Thomson; Jane Worthington; David B Dunger; Barry Widmer; Timothy M Frayling; Rachel M Freathy; Hana Lango; John R B Perry; Beverley M Shields; Michael N Weedon; Andrew T Hattersley; Graham A Hitman; Mark Walker; Kate S Elliott; Christopher J Groves; Cecilia M Lindgren; Nigel W Rayner; Nicolas J Timpson; Eleftheria Zeggini; Melanie Newport; Giorgio Sirugo; Emily Lyons; Fredrik Vannberg; Adrian V S Hill; Linda A Bradbury; Claire Farrar; Jennifer J Pointon; Paul Wordsworth; Matthew A Brown; Jayne A Franklyn; Joanne M Heward; Matthew J Simmonds; Stephen C L Gough; Sheila Seal; Michael R Stratton; Nazneen Rahman; Maria Ban; An Goris; Stephen J Sawcer; Alastair Compston; David Conway; Muminatou Jallow; Melanie Newport; Giorgio Sirugo; Kirk A Rockett; Suzannah J Bumpstead; Amy Chaney; Kate Downes; Mohammed J R Ghori; Rhian Gwilliam; Sarah E Hunt; Michael Inouye; Andrew Keniry; Emma King; Ralph McGinnis; Simon Potter; Rathi Ravindrarajah; Pamela Whittaker; Claire Widden; David Withers; Niall J Cardin; Dan Davison; Teresa Ferreira; Joanne Pereira-Gale; Ingeleif B Hallgrimsdo'ttir; Bryan N Howie; Zhan Su; Yik Ying Teo; Damjan Vukcevic; David Bentley; Matthew A Brown; Alastair Compston; Martin Farrall; Alistair S Hall; Andrew T Hattersley; Adrian V S Hill; Miles Parkes; Marcus Pembrey; Michael R Stratton; Sarah L Mitchell; Paul R Newby; Oliver J Brand; Jackie Carr-Smith; Simon H S Pearce; R McGinnis; A Keniry; P Deloukas; John D Reveille; Xiaodong Zhou; Anne-Marie Sims; Alison Dowling; Jacqueline Taylor; Tracy Doan; John C Davis; Laurie Savage; Michael M Ward; Thomas L Learch; Michael H Weisman; Mathew Brown Journal: Nat Genet Date: 2007-10-21 Impact factor: 38.330
Authors: Kyriaki Michailidou; Per Hall; Anna Gonzalez-Neira; Maya Ghoussaini; Joe Dennis; Roger L Milne; Marjanka K Schmidt; Jenny Chang-Claude; Stig E Bojesen; Manjeet K Bolla; Qin Wang; Ed Dicks; Andrew Lee; Clare Turnbull; Nazneen Rahman; Olivia Fletcher; Julian Peto; Lorna Gibson; Isabel Dos Santos Silva; Heli Nevanlinna; Taru A Muranen; Kristiina Aittomäki; Carl Blomqvist; Kamila Czene; Astrid Irwanto; Jianjun Liu; Quinten Waisfisz; Hanne Meijers-Heijboer; Muriel Adank; Rob B van der Luijt; Rebecca Hein; Norbert Dahmen; Lars Beckman; Alfons Meindl; Rita K Schmutzler; Bertram Müller-Myhsok; Peter Lichtner; John L Hopper; Melissa C Southey; Enes Makalic; Daniel F Schmidt; Andre G Uitterlinden; Albert Hofman; David J Hunter; Stephen J Chanock; Daniel Vincent; François Bacot; Daniel C Tessier; Sander Canisius; Lodewyk F A Wessels; Christopher A Haiman; Mitul Shah; Robert Luben; Judith Brown; Craig Luccarini; Nils Schoof; Keith Humphreys; Jingmei Li; Børge G Nordestgaard; Sune F Nielsen; Henrik Flyger; Fergus J Couch; Xianshu Wang; Celine Vachon; Kristen N Stevens; Diether Lambrechts; Matthieu Moisse; Robert Paridaens; Marie-Rose Christiaens; Anja Rudolph; Stefan Nickels; Dieter Flesch-Janys; Nichola Johnson; Zoe Aitken; Kirsimari Aaltonen; Tuomas Heikkinen; Annegien Broeks; Laura J Van't Veer; C Ellen van der Schoot; Pascal Guénel; Thérèse Truong; Pierre Laurent-Puig; Florence Menegaux; Frederik Marme; Andreas Schneeweiss; Christof Sohn; Barbara Burwinkel; M Pilar Zamora; Jose Ignacio Arias Perez; Guillermo Pita; M Rosario Alonso; Angela Cox; Ian W Brock; Simon S Cross; Malcolm W R Reed; Elinor J Sawyer; Ian Tomlinson; Michael J Kerin; Nicola Miller; Brian E Henderson; Fredrick Schumacher; Loic Le Marchand; Irene L Andrulis; Julia A Knight; Gord Glendon; Anna Marie Mulligan; Annika Lindblom; Sara Margolin; Maartje J Hooning; Antoinette Hollestelle; Ans M W van den Ouweland; Agnes Jager; Quang M Bui; Jennifer Stone; Gillian S Dite; Carmel Apicella; Helen Tsimiklis; Graham G Giles; Gianluca Severi; Laura Baglietto; Peter A Fasching; Lothar Haeberle; Arif B Ekici; Matthias W Beckmann; Hermann Brenner; Heiko Müller; Volker Arndt; Christa Stegmaier; Anthony Swerdlow; Alan Ashworth; Nick Orr; Michael Jones; Jonine Figueroa; Jolanta Lissowska; Louise Brinton; Mark S Goldberg; France Labrèche; Martine Dumont; Robert Winqvist; Katri Pylkäs; Arja Jukkola-Vuorinen; Mervi Grip; Hiltrud Brauch; Ute Hamann; Thomas Brüning; Paolo Radice; Paolo Peterlongo; Siranoush Manoukian; Bernardo Bonanni; Peter Devilee; Rob A E M Tollenaar; Caroline Seynaeve; Christi J van Asperen; Anna Jakubowska; Jan Lubinski; Katarzyna Jaworska; Katarzyna Durda; Arto Mannermaa; Vesa Kataja; Veli-Matti Kosma; Jaana M Hartikainen; Natalia V Bogdanova; Natalia N Antonenkova; Thilo Dörk; Vessela N Kristensen; Hoda Anton-Culver; Susan Slager; Amanda E Toland; Stephen Edge; Florentia Fostira; Daehee Kang; Keun-Young Yoo; Dong-Young Noh; Keitaro Matsuo; Hidemi Ito; Hiroji Iwata; Aiko Sueta; Anna H Wu; Chiu-Chen Tseng; David Van Den Berg; Daniel O Stram; Xiao-Ou Shu; Wei Lu; Yu-Tang Gao; Hui Cai; Soo Hwang Teo; Cheng Har Yip; Sze Yee Phuah; Belinda K Cornes; Mikael Hartman; Hui Miao; Wei Yen Lim; Jen-Hwei Sng; Kenneth Muir; Artitaya Lophatananon; Sarah Stewart-Brown; Pornthep Siriwanarangsan; Chen-Yang Shen; Chia-Ni Hsiung; Pei-Ei Wu; Shian-Ling Ding; Suleeporn Sangrajrang; Valerie Gaborieau; Paul Brennan; James McKay; William J Blot; Lisa B Signorello; Qiuyin Cai; Wei Zheng; Sandra Deming-Halverson; Martha Shrubsole; Jirong Long; Jacques Simard; Montse Garcia-Closas; Paul D P Pharoah; Georgia Chenevix-Trench; Alison M Dunning; Javier Benitez; Douglas F Easton Journal: Nat Genet Date: 2013-04 Impact factor: 38.330
Authors: Anatoliy I Yashin; Deqing Wu; Konstantin Arbeev; Arseniy P Yashkin; Igor Akushevich; Olivia Bagley; Matt Duan; Svetlana Ukraintseva Journal: J Transl Genet Genom Date: 2021-10-19