| Literature DB >> 24804170 |
Damrongrit Setsirichok1, Phuwadej Tienboon1, Nattapong Jaroonruang2, Somkit Kittichaijaroen1, Waranyu Wongseree3, Theera Piroonratana1, Touchpong Usavanarong1, Chanin Limwongse4, Chatchawit Aporntewan5, Marong Phadoongsidhi2, Nachol Chaiyaratana6.
Abstract
This article presents the ability of an omnibus permutation test on ensembles of two-locus analyses (2LOmb) to detect pure epistasis in the presence of genetic heterogeneity. The performance of 2LOmb is evaluated in various simulation scenarios covering two independent causes of complex disease where each cause is governed by a purely epistatic interaction. Different scenarios are set up by varying the number of available single nucleotide polymorphisms (SNPs) in data, number of causative SNPs and ratio of case samples from two affected groups. The simulation results indicate that 2LOmb outperforms multifactor dimensionality reduction (MDR) and random forest (RF) techniques in terms of a low number of output SNPs and a high number of correctly-identified causative SNPs. Moreover, 2LOmb is capable of identifying the number of independent interactions in tractable computational time and can be used in genome-wide association studies. 2LOmb is subsequently applied to a type 1 diabetes mellitus (T1D) data set, which is collected from a UK population by the Wellcome Trust Case Control Consortium (WTCCC). After screening for SNPs that locate within or near genes and exhibit no marginal single-locus effects, the T1D data set is reduced to 95,991 SNPs from 12,146 genes. The 2LOmb search in the reduced T1D data set reveals that 12 SNPs, which can be divided into two independent sets, are associated with the disease. The first SNP set consists of three SNPs from MUC21 (mucin 21, cell surface associated), three SNPs from MUC22 (mucin 22), two SNPs from PSORS1C1 (psoriasis susceptibility 1 candidate 1) and one SNP from TCF19 (transcription factor 19). A four-locus interaction between these four genes is also detected. The second SNP set consists of three SNPs from ATAD1 (ATPase family, AAA domain containing 1). Overall, the findings indicate the detection of pure epistasis in the presence of genetic heterogeneity and provide an alternative explanation for the aetiology of T1D in the UK population.Entities:
Keywords: Attribute selection; Complex disease; Epistasis; Genetic heterogeneity; Genome-wide association study; Pattern recognition; Permutation test; Single nucleotide polymorphism; Type 1 diabetes mellitus
Year: 2013 PMID: 24804170 PMCID: PMC4006521 DOI: 10.1186/2193-1801-2-230
Source DB: PubMed Journal: Springerplus ISSN: 2193-1801
Figure 1Performance of MDR, RF and 2LOmb in the problem with 20 SNPs. The results are averaged over 30 independent simulations. MDR explores only models that do not contain more than 10 SNPs. The MDR output contains the most parsimonious SNP combination that yields the maximum prediction accuracy. The number of trees in RF is set to 100. The RF output consists of top-ranked SNPs, which are SNPs with variable importance in the top five percentiles of a normal distribution (Strobl et al. 2009). Association detection is declared for 2LOmb if the global p-value used as the detection indicator in its result is less than 0.05. The results from MDR, RF and 2LOmb are displayed using red diamond, blue triangle and black square markers, respectively. In each chart, the meeting point between two dotted lines denotes the graphical location representing ideal performance of the algorithm. Ideally, the algorithm should report only the causative SNPs in its output. In other words, both number of output SNPs and number of correctly-identified causative SNPs should be equal to the number of causative SNPs. The charts on which the red diamond markers are invisible denote the situations in which the performance of MDR and 2LOmb is similar.
Figure 2Performance of RF and 2LOmb in the problem with 1,000 SNPs. The number of trees in RF is set to 1,000. The explanation for how the results are obtained and displayed is the same as that given in Figure 1. The charts in this figure are displayed using a coarser scale than the charts in Figure 1.
Computational time required by MDR, RF and 2LOmb to analyse small-scaled simulated data sets with different numbers of available SNPs, different numbers of causative SNPs and different ratios of case samples from two affected groups
| 2&2 | 1:3 | 6,505 | 2 | 539 | 5 | 24 |
| | 1:1 | 6,434 | 2 | 529 | 6 | 23 |
| 3&3 | 1:3 | 6,573 | 2 | 529 | 13 | 32 |
| | 1:1 | 6,611 | 2 | 531 | 14 | 32 |
| 4&4 | 1:3 | 6,372 | 2 | 534 | 32 | 45 |
| | 1:1 | 6,528 | 2 | 538 | 27 | 46 |
| 2&3 | 1:3 | 6,637 | 2 | 529 | 12 | 32 |
| | 1:1 | 6,644 | 3 | 527 | 10 | 30 |
| | 3:1 | 6,776 | 2 | 528 | 10 | 28 |
| 2&4 | 1:3 | 6,513 | 2 | 525 | 16 | 35 |
| | 1:1 | 6,637 | 2 | 528 | 16 | 35 |
| | 3:1 | 6,599 | 2 | 528 | 18 | 34 |
| 3&4 | 1:3 | 6,369 | 2 | 526 | 22 | 38 |
| | 1:1 | 6,410 | 2 | 530 | 25 | 45 |
| 3:1 | 6,435 | 2 | 528 | 22 | 38 | |
The simulation is carried out on a computer server. The computer server is equipped with a Xeon 2.66 GHz quad-core processor and 4GB of main memory. A CentOS 5.5 operating system is installed on the computer server. The computational time is collected from the processing of multiple independent data sets for each simulation setting. The displayed time is the maximum time required by each algorithm to analyse one data set.
Statistical power of 2LOmb to detect genetic heterogeneity in small-scaled simulated data sets with different numbers of available SNPs, different numbers of causative SNPs and different ratios of case samples from two affected groups
| | | ||||
|---|---|---|---|---|---|
| | | ||||
| 2&2 | 1:3 | 1.00 | 0.95 | 1.00 | 0.55 |
| | 1:1 | 1.00 | 1.00 | 1.00 | 1.00 |
| 3&3 | 1:3 | 1.00 | 1.00 | 1.00 | 0.88 |
| | 1:1 | 1.00 | 1.00 | 1.00 | 1.00 |
| 4&4 | 1:3 | 1.00 | 1.00 | 1.00 | 1.00 |
| | 1:1 | 1.00 | 1.00 | 1.00 | 1.00 |
| 2&3 | 1:3 | 1.00 | 0.93 | 1.00 | 0.60 |
| | 1:1 | 1.00 | 1.00 | 1.00 | 1.00 |
| | 3:1 | 1.00 | 0.98 | 1.00 | 0.94 |
| 2&4 | 1:3 | 1.00 | 0.94 | 1.00 | 0.63 |
| | 1:1 | 1.00 | 1.00 | 1.00 | 1.00 |
| | 3:1 | 1.00 | 1.00 | 1.00 | 0.99 |
| 3&4 | 1:3 | 1.00 | 1.00 | 1.00 | 0.88 |
| | 1:1 | 1.00 | 1.00 | 1.00 | 1.00 |
| 3:1 | 1.00 | 1.00 | 1.00 | 0.97 | |
Each data set consists of balanced case-control samples of size 1,600. The results indicate that 2LOmb detects at least one interaction in every data set (global p-value < 0.0001).
Statistical power of 2LOmb to detect genetic heterogeneity in large-scaled simulated data sets with different numbers of available SNPs, different sample sizes and different ratios of case samples from two affected groups where the affected status is governed by a two-locus interaction
| | | ||||
|---|---|---|---|---|---|
| | | ||||
| | |||||
| 1,600 | 1:3 | 1.00 | 0.30 | 1.00 | 0.13 |
| | 1:1 | 1.00 | 1.00 | 1.00 | 1.00 |
| 3,200 | 1:3 | 1.00 | 0.98 | 1.00 | 0.92 |
| | 1:1 | 1.00 | 1.00 | 1.00 | 1.00 |
| 6,400 | 1:3 | 1.00 | 1.00 | 1.00 | 1.00 |
| 1:1 | 1.00 | 1.00 | 1.00 | 1.00 | |
The results indicate that 2LOmb detects at least one interaction in every data set (global p-value < 0.0001).
Computational time required by 2LOmb to analyse large-scaled simulated data sets with different numbers of available SNPs, different sample sizes and different ratios of case samples from two affected groups where the affected status is governed by a two-locus interaction
| 1,600 | 1:3 | 34 | 3,106 |
| | 1:1 | 34 | 3,116 |
| 3,200 | 1:3 | 68 | 6,227 |
| | 1:1 | 68 | 6,256 |
| 6,400 | 1:3 | 135 | 12,503 |
| 1:1 | 136 | 12,560 | |
The simulation is carried out on a computer system with a graphics processing unit. The parallelism of the graphics processing unit is exploited to speed up the computation. The computer system is equipped with an AMD 2.8 GHz quad-core processor, 4GB of main memory and an NVIDIA GeForce GTX 285 graphics processing unit. The graphics processing unit contains 240 streaming processors sharing 1GB of GDDR3 memory. Each streaming processor has a clock rate of 1.48 GHz. An Ubuntu 9.10 operating system is installed on the computer system. The computational time is collected from the processing of multiple independent data sets for each simulation setting. The displayed time is the maximum time required to analyse one data set.
2LOmb identifies 12 SNPs, which are located within or near five genes, from the reduced T1D data set
| | | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6p21.32 | 1 | rs2844678 | | ● | ● | | | | ● | ● | | | | | ● | | | | | | | | |
| | | 2 | rs2523929 | | | | ● | ● | | | | ● | ● | | | | ● | ||||||
| | | 3 | rs2530699 | | | | | | ● | | | | | ● | ● | | | ● | |||||
| 6p21.33 | 4 | rs9262546 | ● | | | | | | | | | | | | | | | ● | ● | ● | |||
| | | 5 | rs6933349 | ● | ● | | ● | | ● | ||||||||||||||
| | | 6 | rs4713423 | | | ● | | ● | |||||||||||||||
| 6p21.3 | 7 | rs9263715 | | | | | | | ● | | ● | | ● | | | | | ● | |||||
| | | 8 | rs9263716 | | | | | | | | ● | | ● | | ● | | | | | ● | |||
| 6p21.3 | 9 | rs9263794 | | | | | | | | | | | | | ● | ● | ● | | | ● | |||
| 10q23.31 | 10 | rs12775041 | | | | | | | | | | | | | | | | | | | ● | ||
| | | 11 | rs12573160 | | | | | | | | | | | | | | | | | | | ● | ● |
| 12 | rs12781171 | ● | |||||||||||||||||||||
Twenty SNP pairs are present in the ensemble. A pair of dots in the same column denotes a SNP pair.
Figure 3LD patterns of SNPs within or near, , , and . LD is explained by D′ displayed in the upper triangle and r2 displayed in the lower triangle. Dark colours indicate high values while pale colours indicate low values. Distances between SNPs are given in terms of the number of base pairs. SNP1 = rs2844678, SNP2 = rs2523929, SNP3 = rs2530699, SNP4 = rs9262546, SNP5 = rs6933349, SNP6 = rs4713423, SNP7 = rs9263715, SNP8 = rs9263716, SNP9 = rs9263794, SNP10 = rs12775041, SNP11 = rs12573160 and SNP12 = rs12781171.
Two-locus penetrances that lead to the heritability of 0.01
| | |||
|---|---|---|---|
|
| |||
| 0 | 0 | 4 | |
| 0 | 2 | 0 | |
| 4 | 0 | 0 | |
AA and BB denote homozygous wild-type genotypes. Aa and Bb denote heterozygous genotypes. aa and bb denote homozygous variant genotypes. All allele frequencies are equal (p = p = 0.5). K = 1/201.
Three-locus penetrances that lead to the heritability of 0.01
| | | | |||||||
|---|---|---|---|---|---|---|---|---|---|
| | |||||||||
| 0 | 0 | 16 | 0 | 0 | 0 | 0 | 0 | 0 | |
| 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | |
| 0 | 0 | 0 | 0 | 0 | 0 | 16 | 0 | 0 | |
AA, BB and CC denote homozygous wild-type genotypes. Aa, Bb and Cc denote heterozygous genotypes. aa, bb and cc denote homozygous variant genotypes. All allele frequencies are equal (p = p = p = 0.5). K = 1/901.
Four-locus penetrances that lead to the heritability of 0.01
| | | | | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| | | |||||||||
| | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||
| | 0 | 0 | 64 | 0 | 0 | 0 | 0 | 0 | 0 | |
| | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| 0 | 0 | 0 | 0 | 8 | 0 | 0 | 0 | 0 | ||
| | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| | 0 | 0 | 0 | 0 | 0 | 0 | 64 | 0 | 0 | |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||
AA, BB, CC and DD denote homozygous wild-type genotypes. Aa, Bb, Cc and Dd denote heterozygous genotypes. aa, bb, cc and dd denote homozygous variant genotypes. All allele frequencies are equal (p = p = p = p = 0.5). K = 1/3501.