| Literature DB >> 25395175 |
Jason H Moore1, Ryan Amos, Jeff Kiralis, Peter C Andrews.
Abstract
Simulation plays an essential role in the development of new computational and statistical methods for the genetic analysis of complex traits. Most simulations start with a statistical model using methods such as linear or logistic regression that specify the relationship between genotype and phenotype. This is appealing due to its simplicity and because these statistical methods are commonly used in genetic analysis. It is our working hypothesis that simulations need to move beyond simple statistical models to more realistically represent the biological complexity of genetic architecture. The goal of the present study was to develop a prototype genotype-phenotype simulation method and software that are capable of simulating complex genetic effects within the context of a hierarchical biology-based framework. Specifically, our goal is to simulate multilocus epistasis or gene-gene interaction where the genetic variants are organized within the framework of one or more genes, their regulatory regions and other regulatory loci. We introduce here the Heuristic Identification of Biological Architectures for simulating Complex Hierarchical Interactions (HIBACHI) method and prototype software for simulating data in this manner. This approach combines a biological hierarchy, a flexible mathematical framework, a liability threshold model for defining disease endpoints, and a heuristic search strategy for identifying high-order epistatic models of disease susceptibility. We provide several simulation examples using genetic models exhibiting independent main effects and three-way epistatic effects.Entities:
Keywords: bioinformatics; epistasis; simulation; software
Mesh:
Year: 2014 PMID: 25395175 PMCID: PMC4270828 DOI: 10.1002/gepi.21865
Source DB: PubMed Journal: Genet Epidemiol ISSN: 0741-0395 Impact factor: 2.135
Figure 1The left panels show screenshots of the biological and mathematical framework as well as the liability distribution for Models 1 (A) and 2 (B). The black notches on the right side of the liability distributions indicate the threshold for disease. To the right of each HIBACHI model is the MDR model showing the distribution of cases (dark bars) and controls (light bars) for each genotype combination. Dark-shaded genotype combinations indicate high-risk of disease. Shown below each MDR model is the ViSEN network with main effects (circles), pairwise interactions (lines), and three-way interactions (triangles) highlighted in proportion to their effect size.
Figure 4The left panels show screenshots of the biological and mathematical framework as well as the liability distribution for Models 7 (A) and 8 (B). The black notches on the right side of the liability distributions indicate the threshold for disease. To the right of each HIBACHI model is the MDR model showing the distribution of cases (dark bars) and controls (light bars) for each genotype combination. Dark-shaded genotype combinations indicate high-risk of disease. Shown below each MDR model is the ViSEN network with main effects (circles), pairwise interactions (lines), and three-way interactions (triangles) highlighted in proportion to their effect size.
Performance measures shown include information gain (IG) and classification accuracy. The first P-value shown is derived from a standard permutation test although the second comes from the explicit test of epistasis
| Model | Functions | IG (TF1) | IG (TF2) | IG (Enhancer) | IG (TF1, TF2, Enhancer) | Accuracy | ||
|---|---|---|---|---|---|---|---|---|
| 1 | ADD, ADD | 0.079 | 0.067 | 0.081 | 0.002 | 0.732 | <0.001 | 0.06 |
| 2 | ADD, MULT | 0.303 | 0.103 | 0.102 | 0 | 0.822 | <0.001 | 0.043 |
| 3 | MULT, MULT | 0.137 | 0.134 | 0.148 | 0.002 | 0.817 | <0.001 | <0.001 |
| 4 | XOR, XOR | 0.006 | 0.003 | 0.008 | 0.033 | 0.644 | <0.001 | <0.001 |
| 5 | MOD2, XOR | 0.002 | 0 | 0.001 | 0.046 | 0.621 | <0.001 | <0.001 |
| 6 | BITX, XOR | 0.003 | 0.001 | 0.001 | 0.04 | 0.644 | <0.001 | <0.001 |
| 7 | BITX, MOD2 | 0 | 0 | 0.002 | 0.058 | 0.636 | <0.001 | <0.001 |
| 8 | CHS, BITA | 0.146 | 0.002 | 0.001 | 0.16 | 0.7875 | <0.001 | <0.001 |
Machine learning method include classification trees (CT), k-nearest neibors (kNN), naïve Bayes (NB), neural networks (NN), and support vector machines (SVM). Numbers shown are classification accuracies
| Classification accuracy | ||||||
|---|---|---|---|---|---|---|
| Model | Functions | CT | kNN | NB | NN | SVM |
| 1 | ADD, ADD | 0.732 | 0.66 | 0.73 | 0.732 | 0.73 |
| 2 | ADD, MULT | 0.842 | 0.845 | 0.842 | 0.842 | 0.845 |
| 3 | MULT, MULT | 0.835 | 0.838 | 0.835 | 0.835 | 0.835 |
| 4 | XOR, XOR | 0.638 | 0.572 | 0.618 | 0.64 | 0.63 |
| 5 | MOD2, XOR | 0.625 | 0.575 | 0.488 | 0.618 | 0.59 |
| 6 | BITX, XOR | 0.625 | 0.59 | 0.555 | 0.625 | 0.568 |
| 7 | BITX, MOD2 | 0.635 | 0.548 | 0.508 | 0.585 | 0.558 |
| 8 | CHS, BITA | 0.788 | 0.76 | 0.658 | 0.788 | 0.778 |
Figure 5Receiver operating characteristic (ROC) curves for each machine learning method applied to example data sets from Models 1 (A) and 7 (B). Note the performance diversity of the different methods for Model 7 where the genetic effects are mostly due to three-way interactions.