Literature DB >> 20030838

Neural networks for modeling gene-gene interactions in association studies.

Frauke Günther¹, Nina Wawro, Karin Bammann.

Abstract

BACKGROUND: Our aim is to investigate the ability of neural networks to model different two-locus disease models. We conduct a simulation study to compare neural networks with two standard methods, namely logistic regression models and multifactor dimensionality reduction. One hundred data sets are generated for each of six two-locus disease models, which are considered in a low and in a high risk scenario. Two models represent independence, one is a multiplicative model, and three models are epistatic. For each data set, six neural networks (with up to five hidden neurons) and five logistic regression models (the null model, three main effect models, and the full model) with two different codings for the genotype information are fitted. Additionally, the multifactor dimensionality reduction approach is applied.
RESULTS: The results show that neural networks are more successful in modeling the structure of the underlying disease model than logistic regression models in most of the investigated situations. In our simulation study, neither logistic regression nor multifactor dimensionality reduction are able to correctly identify biological interaction.
CONCLUSIONS: Neural networks are a promising tool to handle complex data situations. However, further research is necessary concerning the interpretation of their parameters.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2009 PMID： 20030838 PMCID： PMC2817696 DOI： 10.1186/1471-2156-10-87

Source DB: PubMed Journal: BMC Genet ISSN： 1471-2156 Impact factor: 2.797

Background

The investigation of complex diseases plays an important role in genetic epidemiology where the identification of genetic risk factors is of great interest. Besides the study of main effects, the interplay of two or more genetic risk factors gains more and more attention. The identification of such a biological interaction or epistasis, however, is linked to new challenges for statistical methods. A major problem is the discrepancy between statistical and biological interaction. Statistical interaction is commonly defined as the deviation from an additive effect of single risk factors on the outcome, respectively on the transformed outcome. In logistic regression models, for example, a multiplicative structural model is applied and an additive effect on the logit-transformed outcome implies a multiplicative effect on the untransformed outcome. Therefore, statistical interaction in a logistic regression model is understood as deviation from a multiplicative effect. On the contrary, biological interaction is present if one gene is influencing the effect of another one [1]. Both terms do not coincide as was shown for example by North et al. [2] or Foraita et al. [3]. Nevertheless, a meaningful interpretation of genetic studies requires the detection of biological interaction with statistical methods (cf. [4,5]). A variety of parametric and non-parametric methods has been proposed for modeling and detecting gene-gene interaction, e.g. support-vector machines [6], random forests [7,8], multi-factor dimensionality reduction (MDR, [9,10]), combinatorial partitioning methods [11], focused interaction testing framework [12], classification and regression trees (CART, [13]), logic regression [14], and lasso regression [15]. A useful classification is given by Musani et al. [16], who distinguish between regression-based methods, data reduction-based methods, and pattern recognition methods in their overview. Despite the wealth of these approaches, none of the proposed methods is optimal for all two-locus disease models (see e.g. [17-19]). Consequently, there is no established method for analyzing gene-gene interactions so far [20]. Since parametric methods have problems to detect interaction in the absence of main effects and non-parametric approaches are ineffective when main effects are present [16,21], it might well be that there is no single approach appropriate for all types of biological interaction. Currently, generalized linear models, and here logistic regression models, as well as MDR are predominantly applied (see e.g. [22-27]). Another tool that has been employed in genetic epidemiology during the last 15 years is the neural network approach (see e.g. [28-32]). Neural networks are a flexible statistical tool to model any functional relationship between covariates and response variables. Therefore, they represent a promising approach to deal with the difficulties associated with modeling biological gene-gene interactions. They have as well been successfully applied for variable selection as for example with genetic programming neural networks (GPNN, [33-36]) or grammatical evolution neural networks (GENN, [37,38]). Both approaches were developed to identify an optimal network topology. Motsinger et al. [39] successfully applied GENN to simulated genome wide association data with 500,000 Single Nucleotide Polymorphisms (SNPs) showing the general ability of neural networks to handle such large data sets. However, variable selection is not the focus of this paper. The aim of this paper is to explore the ability of neural networks to model different types of biological gene-gene interactions. For this purpose, a simulation study is conducted to investigate the behavior of neural networks in various situations. We assume a case-control study with equal numbers of cases and controls. Following the scenarios of Risch [40] and the concept of epistatic models as classified by Li and Reich [41], different theoretical types of gene-gene interactions are studied. There are exactly two loci involved, i.e. variable selection is not a problem. The results are compared with those of logistic regression models and those of MDR analyses. Finally, the advantages and disadvantages of using a neural network approach are discussed.

Methods

Neural networks

A feed-forward multilayer perceptron (MLP) is chosen as neural network [42]. The general idea of an MLP is to approximate arbitrary functional relationships between covariates and response variables. The underlying structure of an MLP is a weighted, directed graph, whose vertices are called neurons and whose edges are called synapses. The neurons are organized in layers and each layer is fully connected by synapses to the next layer. The input layer contains all considered covariates and the output layer the response variables. An arbitrary number of so-called hidden layers can be included between the input and the output layer. See Figure 1 for an example of a neural network with one hidden layer.

Figure 1

Neural network. Neural network with one hidden layer consisting of three hidden neurons.

Neural network. Neural network with one hidden layer consisting of three hidden neurons. Data is passing the neural network as signals. These signals travel the synapses and pass the neurons where the signals are processed. All incoming signals are added and the activation function σ is applied to the resulting sum. Additionally, a weight is attached to each of the synapses. A positive weight indicates an amplifying, a negative weight a repressing effect on the signal. During the training process, the weights are modified by a learning algorithm. The learning algorithm minimizes an error function that depends on the difference between the given output and the output estimated by the neural network. In general, the strength of the modification depends on a specified learning rate. The minimal MLP without hidden layer is equivalent to the generalized linear model [43] and computes the function where w denotes the weight vector including intercept, x the input vector, and σ the activation function. Any arbitrary function can be chosen as activation function, although most learning algorithms require a differentiable activation function. Choosing the inverse of the link function used for the logistic regression model σ (z) = 1/(1 + exp(-z)), the MLP without hidden layer is algebraically equivalent to the logistic regression model and computes In this case, all weights wof the MLP correspond to the regression coefficients βof the logistic regression model. Hidden layers can be included to increase the modeling flexibility. An MLP with one hidden layer computes the following function and is capable to model any piecewise continuous function [44]. Here, there is a lack of interpretation of the parameters. In the present paper, we investigate MLPs with at most one hidden layer. Resilient backpropagation [45] and cross entropy are chosen as learning algorithm and error function, respectively. The latter choice guarantees equivalence of the trained weights to maximum-likelihood estimation (see e.g. [46]). The employment of resilient backpropagation as learning algorithm does not require a transformation of continuous data. It solves the problem of choosing an appropriate learning rate for each data situation.

Design of the simulation study

We conduct a simulation study, where neural network models are used to fit different two-locus disease models in a case-control design. For each of these models, one low risk and one high risk scenario is simulated. Unconditional logistic regression models are fitted to the same data sets to compare the results with an established method. For judging the ability to model the underlying disease model, the estimated penetrance matrices are compared to the theoretical penetrance matrices.

Two-locus disease models

Six different two-locus disease models are considered: three models introduced by Risch [40] and three different epistatic models. They can be distinguished by the structure of their penetrance matrices f = [f], where i, j ∈ {0, 1, 2} represent the genotype at the two loci. 1. The first two-locus disease model is Risch's additivity model (ADD). Here, the penetrance matrix is given by summing the so-called penetrance terms aand b where Y denotes the case-control status and Gand G, G, G∈ {0, 1, 2}, the genotypes at the two involved loci. The penetrance terms aand bare restricted to 0 ≤ a, b≤ 1 and a+ b≤ 1. This model represents biological independence of both loci. 2. For Risch's heterogeneity model (HET), the penetrance matrix is also determined by the penetrance terms Like the additivity model, the heterogeneity model describes a model of biological independence for 0 ≤ a, b≤ 1. However, in this case no further constraints on the penetrance terms are necessary. 3. The third setting is Risch's multiplicative model (MULT). The penetrance matrix is given by the penetrance terms as follows The multiplicative model represents biological interaction. 4. In the first epistatic model (EPI RR), the penetrance matrix is given by a matrix of the following type: where the constant term c denotes the baseline risk of getting the disease and r the risk increase or decrease. This model assumes that both genes have a recessive effect on the disease, since there is only an increased or decreased risk if both loci carry two mutated alleles. 5. The penetrance matrix of the second epistatic model (EPI DD) is as follows i.e. both loci are assumed to be dominant. In this setting, an increased or decreased risk is only observed if both loci carry at least one mutated allele. 6. The last considered scenario is a mixed epistatic model (EPI RD). The penetrance matrix is given by In this situation, one gene (A) has a recessive and one gene (B) has a dominant effect on the disease. All epistatic models represent gene-gene interaction. By choosing the parameters r, r1, r2 and the ratios a1/a0, a2/a0, b1/b0, and b2/b0, respectively, different risk scenarios can be generated.

Data generation

The data generation follows a two-step procedure. As a first step, basic populations with one million observations are simulated. For the six two-locus disease models introduced above we investigate two risk scenarios each (see Table 1). This results in 12 basic populations with two biallelic loci, A and B. The genetic information is drawn randomly with a minor allele frequency for both loci of 0.3 to ensure sufficient cell frequencies in the final case-control samples. Both loci are assumed to be in linkage equilibrium and it is assumed that the Hardy-Weinberg equilibrium holds. The case-control status is drawn according to probabilities of a given penetrance matrix in relation to the respective disease model and the risk scenario. In all 12 settings, parameters are chosen such that the overall disease prevalence is equal to 0.01. The genotype information is described by a codominant coding, i.e. the genotype at each locus represents the number of mutated alleles.

Table 1

Risk scenarios.

Two-locus disease model	Low risk scenario	High risk scenario
ADD, HET, MULT	a₁= 2·a₀	a₁= 5·a₀
	a₂= 4·a₀	a₂= 10·a₀
	b₁= 5·b₀	b₁= 5·b₀
	b₂= 10·b₀	b₂= 10·b₀

EPI RR	r = 5	r = 10

EPI DD, EPI RD	r₁= 2	r₁= 5
	r₂= 4	r₂= 10

Applied risk scenarios for all two-locus disease models.

Risk scenarios. Applied risk scenarios for all two-locus disease models. As a second step, 100 case-control samples with 1,000 cases and 1,000 controls are drawn randomly from each basic population, i.e. each combination of two-locus disease model and risk scenario. Overall, this results in 12 times 100 case-control samples that will be analyzed.

Modeling the data

Model-building with neural networks is done using six different network topologies from zero neurons in the hidden layer (i.e. no hidden layer) up to five neurons in the hidden layer. Each topology is trained five times with synaptic weights initialized with random numbers drawn from a standard normal distribution to avoid local minima. From these fitted models, the best model for each data set, i.e. the network topology, is chosen using Akaike's Information Criterion (AIC, [47]). The following five logistic regression models are fitted to each data set: the null model (NM), three main effect models (only locus A (SiA), only locus B (SiB), both main effects (ME)), and a full model including both main effects and an interaction term (FM). The best model for each data set is chosen based on the AIC. Note that the neural network with zero neurons in the hidden layer is algebraically equivalent to the main effect model ME. In a second approach, logistic regression models are fitted to the data with two dichotomous design variables representing each locus. Instead of counting the number of mutated alleles, these two variables reflect the heterozygous genotype and the homozygous genotype with two mutated alleles, respectively. For instance, the main effect model for locus A only (SiA) is modeled with a codominant coding as as opposed to with design variables. The observation is indexed by k, β represents the regression coefficients and 1 an indicator function. Table 2 gives an overview of the fitted statistical models and the numbers of needed parameters for all considered models.

Table 2

Number of parameters.

	Neural network
0 hidden neurons	3
1 hidden neuron	5
2 hidden neurons	9
3 hidden neurons	13
4 hidden neurons	17
5 hidden neurons	21

	Logistic regression	Logistic regression (DV)

Null model (NM)	1	1
One main effect (SiA/SiB)	2	3
Both main effects (ME)	3	5
Full model (FM)	4	9

Number of parameters for neural networks, logistic regression models and logistic regression models with design variables (DV).

Number of parameters. Number of parameters for neural networks, logistic regression models and logistic regression models with design variables (DV). These three applied statistical methods deliver as output an estimation of the probability to be a case, i.e. the penetrance for each genotype-genotype combination. We compare these estimated penetrance matrices to the theoretical ones to judge the ability of the statistical methods to model the underlying two-locus disease model. A penetrance matrix derived from a case-control sample differs considerably from one derived from the basic population, since the penetrance matrix depends on the prevalence of disease in the considered data. Therefore, we have to compute the theoretical penetrance matrix for the case-control sample using the penetrance matrix from the basic population, the allele frequencies and the prevalence of the population (see appendix for an example). The comparison of the obtained theoretical penetrance matrix with the penetrance matrices estimated by the three different statistical approaches gives results which are independent from sampling error, since the theoretical penetrance matrix symbolizes a perfectly drawn case-control sample. For each of the 12 populations, the mean absolute difference between theoretical and estimated penetrance matrix is calculated element by element for each genotype-genotype combination over the n = 100 case-control samples: where i, j ∈ {0, 1, 2}, and fand denote the entries of the theoretical and estimated penetrance matrix of the kth sample, respectively. Furthermore, the sum of the mean absolute differences ∑Eis considered. The data generation and the statistical analyses for neural network and logistic regression are performed using R [48]. The package for the MLP, neuralnet, was newly implemented by our group and is published on CRAN [49]. Additionally, the MDR approach is applied to the data. The analyses are conducted by the java-based open source software MDR release 1.2.5 with default configurations [50]. In particular, analysis configurations are specified as follows: the random seed is set to zero, the attribute count maximum is set to two and the cross-validation count to ten. The MDR identifies a set of functional variables that is best for classifying cases and controls. Due to the number of simulated loci, the software can only select one of three sets: either locus A or locus B only or both loci. Additionally, it provides a dendrogram to distinguish between redundant and synergistic variables based on information theory [51].

Results

In a first step, we investigate the ability of neural networks and logistic regression models to model different two-locus disease models. Table 3 shows the results for Risch's additivity model. Here, the sum of the mean absolute differences between estimated penetrance and theoretical penetrance matrix is lowest for the neural networks. This is most pronounced in the high risk scenario (∑E= 0.2059 for neural networks versus ∑E= 0.2544 and ∑E= 0.2804 for logistic regression models without and with design variables). Logistic regression models with design variables have in general higher deviations than those without design variables. These results are also reflected in the element-wise comparison of the estimated matrices. For each of the risk scenarios, the neural network estimates five out of nine penetrances with the highest accuracy, i.e. with smallest difference to the theoretical penetrance, compared to the logistic regression models. The heterogeneity model yields virtually the same results as the additivity model (results not shown).

Table 3

Additive model (ADD).

	Low risk	High risk
	a₁= 2·a₀; a₂= 4·a₀	a₁= 5·a₀; a₂= 10·a₀
	b₁= 5·b₀; b₂= 10·b₀	b₁= 5·b₀; b₂= 10·b₀
Theoretical penetrance matrix

Neural network
Mean absolute difference E
Sum	0.2313	0.2059

Logistic regression
Mean absolute difference E
Sum	0.2530	0.2544

Logistic regression (design variables)
Mean absolute difference E
Sum	0.2897	0.2804