| Literature DB >> 30072632 |
Xia Cao1, Guoxian Yu2, Jie Liu3, Lianyin Jia4, Jun Wang5.
Abstract
Identifying single nucleotide polymorphism (SNP) interactions is considered as a popular and crucial way for explaining the missing heritability of complex diseases in genome-wide association studies (GWAS). Many approaches have been proposed to detect SNP interactions. However, existing approaches generally suffer from the high computational complexity resulting from the explosion of candidate high-order interactions. In this paper, we propose a two-stage approach (called ClusterMI) to detect high-order genome-wide SNP interactions based on significant pairwise SNP combinations. In the screening stage, to alleviate the huge computational burden, ClusterMI firstly applies a clustering algorithm combined with mutual information to divide SNPs into different clusters. Then, ClusterMI utilizes conditional mutual information to screen significant pairwise SNP combinations in each cluster. In this way, there is a higher probability of identifying significant two-locus combinations in each group, and the computational load for the follow-up search can be greatly reduced. In the search stage, two different search strategies (exhaustive search and improved ant colony optimization search) are provided to detect high-order SNP interactions based on the cardinality of significant two-locus combinations. Extensive simulation experiments show that ClusterMI has better performance than other related and competitive approaches. Experiments on two real case-control datasets from Wellcome Trust Case Control Consortium (WTCCC) also demonstrate that ClusterMI is more capable of identifying high-order SNP interactions from genome-wide data.Entities:
Keywords: clustering; genome-wide association studies; high-order SNP interactions; improved ant colony optimization; mutual information
Mesh:
Year: 2018 PMID: 30072632 PMCID: PMC6121365 DOI: 10.3390/ijms19082267
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1Power and runtime of ClusterMI under different numbers of clusters (k). (a) Power of ClusterMI for different k; (b) runtime of ClusterMI for different k.
Figure 2Power and runtime of ClusterMI under different cMI thresholds (τ). (a) Power of ClusterMI for different τ; (b) runtime of ClusterMI for different τ.
Figure 3Powers of dynamic clustering for high-order genome-wide epistatic interactions detecting (DCHE), differential evolution algorithm combined with a classification based multifactordimensionality reduction (DECMDR), epistasis detector based on the clustering of relatively frequent items (EDCF), HiSeeker(A), HiSeeker(E), ClusterMI(A) and ClusterMI(E) on five three-locus disease models under different allele frequencies (MAF), sample sizes (N) and linkage disequilibrium (LD). N0 is the number of controls, N1 the number of cases and M the number of SNPs. The absence of a bar indicates no power. A: ACO search strategy, E: exhaustive search strategy. (a) Model 1; (b) Model 2; (c) Model 3; (d) Model 4; (e) Model 5.
Figure 4Powers and runtime of different approaches on Model 5 under different allele frequencies (MAF) and linkage disequilibrium (LD) with 8000 samples and 3000 SNPs. (a) Power; (b) runtime.
Significant two-locus and three-locus combinations identified by ClusterMI on the Wellcome Trust Case Control Consortium (WTCCC) breast cancer (BC) data.
| Chromosome | SNP Combinations | Related Genes | Single-Locus | Combination |
|---|---|---|---|---|
| chr3 | (rs13100173, rs1108842) | (HYAL3, GNL3) | ( |
|
| chr6 | (rs9257694, rs879882) | (LOC105375005, POU5F1) | ( |
|
| chr6 | (rs3094576, rs644827) | (*, SLC44A4) | ( |
|
| chr16 | (rs17822931, rs3785181) | (ABCC11, GAS11) | ( |
|
| chr16 | (rs7190823, rs4408545) | (FANCA, AFG3L1P) | ( |
|
| chr6 | (rs9257694, rs2523608, rs11244) | (LOC105375005, HLA-B, HLA-DOB) | ( |
|
* Indicates that the related gene is unknown. p-value is estimated by the chi-square test.
Significant two-locus and three-locus combinations identified by ClusterMI on the WTCCC celiac disease (CD) data.
| Chromosome | SNP Combinations | Related Genes | Single-Locus | Combination |
|---|---|---|---|---|
| chr1 | (rs3748816, rs3795263) | (MMEL1, ACTRT2) | ( |
|
| chr2 | (rs3816281, rs4973588) | (PLEK, NGEF) | ( |
|
| chr6 | (rs3823418, rs4151664) | (PSORS1C1, NELFE) | ( |
|
| chr6 | (rs2021723, rs3093662) | (TRIM40, TNF) | ( |
|
| chr22 | (rs2298428, rs1321, rs5771069) | (YDJC, ALG12, IL17REL) | ( |
|
p-value was estimated by the chi-square test.
Figure 5Runtime of different approaches on the simulated datasets. (a) The sample size N varies from 1000–4000 with the number of SNPs M = 1000; (b) the number of SNPs M varies from 1000–4000 with the sample size .
Figure 6Procedure overview of ClusterMI (Clustering combined with Mutual Information). SNP: single nucleotide polymorphisms; ACO: ant colony optimization algorithm.