| Literature DB >> 18710518 |
Alison A Motsinger-Reif1, Theresa J Fanelli, Anna C Davis, Marylyn D Ritchie.
Abstract
BACKGROUND: With the advent of increasingly efficient means to obtain genetic information, a great insurgence of data has resulted, leading to the need for methods for analyzing this data beyond that of traditional parametric statistical approaches. Recently we introduced Grammatical Evolution Neural Network (GENN), a machine-learning approach to detect gene-gene or gene-environment interactions, also known as epistasis, in high dimensional genetic epidemiological data. GENN has been shown to be highly successful in a range of simulated data, but the impact of error common to real data is unknown. In the current study, we examine the power of GENN to detect interesting interactions in the presence of noise due to genotyping error, missing data, phenocopy, and genetic heterogeneity. Additionally, we compare the performance of GENN to that of another computational method - Multifactor Dimensionality Reduction (MDR).Entities:
Year: 2008 PMID: 18710518 PMCID: PMC2531119 DOI: 10.1186/1756-0500-1-65
Source DB: PubMed Journal: BMC Res Notes ISSN: 1756-0500
Figure 1An overview of the GENN method. First, a set of parameters must be initialized in the configuration file. These parameters specify details for the evolutionary processes. Second, the data are divided into 10 equal parts for 10-fold cross-validation. Third, training begins by generating an initial population of random solutions using sensible initialization, which guarantees functional NNs in the initial population. Fourth, each newly generated NN is evaluated on the data in the training set and its fitness recorded. Fifth, a selection technique that is specified by the user is used to select the best solutions for crossover and reproduction in the evolutionary process. The cycle begins with the new generation, which is equal in size to the original population. This cycle continues until either a classification error of zero is found or a limit on the number of generations is reached. After each generation, an optimal solution is identified. At the end of GENN evolution, the overall best solution is selected as the optimal NN. Sixth, this best GENN model is tested on the 1/10 of the data left out to estimate the prediction error of the model. Steps two through six are performed ten times with the same parameters settings, each time using a different 9/10 of the data for training and 1/10 of the data for testing. At the end of a GENN analysis, 10 models are generated – one best model from each cross-validation interval. A final model is chosen based on maximization of the cross-validation consistency of variables/loci across the ten models.
Figure 2Penetrance Functions used to simulate epistasis models.
Power of MDR (from Ritchie et al. 2003) to detect correct functional epistatic loci.
| Source of Noise | Power (%) | |||||
| Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | Model 6 | |
| None | 100 | 100 | 99 | 99 | 82 | 84 |
| GE | 100 | 100 | 100 | 97 | 80 | 92 |
| GH | 3 | 41 | 2 | 3 | 4 | 4 |
| PC | 90 | 99 | 45 | 32 | 30 | 32 |
| MS | 100 | 100 | 99 | 97 | 82 | 87 |
| GE+GH | 4 | 41 | 2 | 3 | 4 | 6 |
| GE+PC | 94 | 99 | 41 | 48 | 28 | 33 |
| GE+MS | 100 | 100 | 98 | 98 | 74 | 84 |
| GH+PC | 0 | 1 | 0 | 0 | 0 | 0 |
| GH+MS | 5 | 38 | 0 | 2 | 4 | 6 |
| PC+MS | 96 | 99 | 42 | 43 | 14 | 16 |
| GE+GH+PC | 1 | 1 | 0 | 0 | 0 | 0 |
| GE+GH+MS | 6 | 34 | 2 | 1 | 3 | 7 |
| GH+PC+MS | 0 | 0 | 0 | 0 | 0 | 0 |
| GE+PC+MS | 94 | 100 | 48 | 42 | 18 | 16 |
| GE+GH+PC+MS | 0 | 1 | 0 | 1 | 0 | 0 |
GE = 5% Genotyping Error; GH = 50% Genetic Heterogeneity; PC = 50% Phenocopy; MS = 5% Missing Data
Power of GENN to detect correct functional epistatic loci.
| Source of Noise | Power (%) | |||||
| Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | Model 6 | |
| None | 100 | 100 | 96 | 91 | 69 | 72 |
| GE | 100 | 100 | 96 | 85 | 58 | 68 |
| GH | 7 | 4 | 15 | 16 | 14 | 16 |
| PC | 88 | 92 | 21 | 12 | 17 | 21 |
| MS | 100 | 100 | 99 | 82 | 42 | 74 |
| GE+GH | 3 | 6 | 14 | 16 | 11 | 9 |
| GE+PC | 92 | 95 | 19 | 11 | 12 | 16 |
| GE+MS | 100 | 100 | 93 | 75 | 48 | 58 |
| GH+PC | 9 | 9 | 13 | 15 | 10 | 11 |
| GH+MS | 1 | 0 | 0 | 0 | 0 | 1 |
| PC+MS | 65 | 85 | 18 | 13 | 7 | 9 |
| GE+GH+PC | 3 | 9 | 2 | 2 | 7 | 3 |
| GE+GH+MS | 5 | 1 | 0 | 0 | 0 | 0 |
| GH+PC+MS | 0 | 0 | 0 | 0 | 0 | 0 |
| GE+PC+MS | 62 | 81 | 14 | 9 | 9 | 9 |
| GE+GH+PC+MS | 0 | 0 | 0 | 0 | 0 | 0 |
GE = 5% Genotyping Error; GH = 50% Genetic Heterogeneity; PC = 50% Phenocopy; MS = 5% Missing Data
Power of MDR (from Ritchie et al. 2007) to detect primary genetic model in data with genetic heterogeneity.
| Source of Noise | Power (%) to Detect Primary Model (5,10) | |||||
| Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | Model 6 | |
| GH | 30 | 18 | 19 | 25 | 8 | 8 |
| GE+GH | 39 | 18 | 19 | 25 | 8 | 8 |
| GH+PC | 11 | 18 | 5 | 3 | 4 | 2 |
| GH+MS | 28 | 23 | 19 | 19 | 9 | 13 |
| GE+GH+PC | 10 | 18 | 8 | 4 | 3 | 3 |
| GE+GH+MS | 29 | 22 | 21 | 20 | 7 | 4 |
| GH+PC+MS | 12 | 22 | 5 | 5 | 2 | 4 |
| GE+GH+PC+MS | 16 | 17 | 4 | 3 | 1 | 3 |
GE = 5% Genotyping Error; GH = 50% Genetic Heterogeneity; PC = 50% Phenocopy; MS = 5% Missing Data
Power of GENN to detect primary genetic model in data with genetic heterogeneity.
| Source of Noise | Power (%) to Detect Primary Model (5,10) | |||||
| Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | Model 6 | |
| GH | 34 | 49 | 20 | 13 | 8 | 11 |
| GE+GH | 48 | 49 | 20 | 13 | 8 | 10 |
| GH+PC | 10 | 11 | 3 | 0 | 4 | 3 |
| GH+MS | 31 | 35 | 12 | 11 | 9 | 9 |
| GE+GH+PC | 11 | 8 | 3 | 0 | 1 | 2 |
| GE+GH+MS | 43 | 29 | 13 | 7 | 8 | 5 |
| GH+PC+MS | 3 | 8 | 3 | 1 | 5 | 3 |
| GE+GH+PC+MS | 4 | 8 | 1 | 3 | 0 | 2 |
GE = 5% Genotyping Error; GH = 50% Genetic Heterogeneity; PC = 50% Phenocopy; MS = 5% Missing Data
Power of MDR (from Ritchie et al. 2007) to detect either genetic model in data with genetic heterogeneity.
| Source of Noise | Power (%) to Detect Either Model (5,10 or 3,4) | |||||
| Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | Model 6 | |
| GH | 70 | 34 | 42 | 41 | 20 | 19 |
| GE+GH | 69 | 34 | 42 | 41 | 20 | 19 |
| GH+PC | 24 | 35 | 9 | 8 | 7 | 5 |
| GH+MS | 65 | 40 | 42 | 31 | 18 | 22 |
| GE+GH+PC | 27 | 35 | 10 | 8 | 7 | 6 |
| GE+GH+MS | 64 | 44 | 42 | 41 | 16 | 11 |
| GH+PC+MS | 23 | 38 | 9 | 10 | 4 | 6 |
| GE+GH+PC+MS | 31 | 36 | 9 | 7 | 4 | 3 |
GE = 5% Genotyping Error; GH = 50% Genetic Heterogeneity; PC = 50% Phenocopy; MS = 5% Missing Data
Power of GENN to detect either genetic model in data with genetic heterogeneity.
| Source of Noise | Power (%) to Detect Either Model (5,10 or 3,4) | |||||
| Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | Model 6 | |
| GH | 79 | 89 | 38 | 24 | 16 | 19 |
| GE+GH | 92 | 90 | 41 | 24 | 16 | 22 |
| GH+PC | 22 | 22 | 4 | 5 | 6 | 4 |
| GH+MS | 65 | 62 | 26 | 19 | 17 | 15 |
| GE+GH+PC | 23 | 23 | 4 | 3 | 2 | 4 |
| GE+GH+MS | 83 | 61 | 23 | 15 | 17 | 10 |
| GH+PC+MS | 13 | 17 | 3 | 2 | 7 | 4 |
| GE+GH+PC+MS | 7 | 16 | 2 | 4 | 1 | 3 |
GE = 5% Genotyping Error; GH = 50% Genetic Heterogeneity; PC = 50% Phenocopy; MS = 5% Missing Data
Power of MDR to detect only correct loci (from either/both genetic models) in data with genetic heterogeneity. GE = 5% Genotyping Error; GH = 50% Genetic Heterogeneity; PC = 50% Phenocopy; MS = 5% Missing Data
| Source of Noise | Power (%) to Detect Only Correct Loci | |||||
| Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | Model 6 | |
| GH | 71 | 36 | 46 | 45 | 26 | 21 |
| GE+GH | 71 | 36 | 46 | 45 | 26 | 21 |
| GH+PC | 29 | 39 | 11 | 8 | 10 | 7 |
| GH+MS | 68 | 41 | 48 | 36 | 21 | 29 |
| GE+GH+PC | 30 | 39 | 13 | 9 | 9 | 10 |
| GE+GH+MS | 65 | 46 | 45 | 45 | 20 | 16 |
| GH+PC+MS | 27 | 42 | 10 | 17 | 10 | 11 |
| GE+GH+PC+MS | 34 | 39 | 11 | 15 | 8 | 10 |
Power of GENN to detect only correct loci (from either/both genetic models) in data with genetic heterogeneity.
| Source of Noise | Power (%) to Detect Only Correct Loci | |||||
| Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | Model 6 | |
| GH | 100 | 100 | 83 | 71 | 57 | 68 |
| GE+GH | 100 | 100 | 83 | 73 | 58 | 49 |
| GH+PC | 59 | 57 | 42 | 49 | 35 | 40 |
| GH+MS | 99 | 99 | 77 | 61 | 46 | 61 |
| GE+GH+PC | 63 | 81 | 39 | 40 | 23 | 32 |
| GE+GH+MS | 100 | 100 | 79 | 72 | 58 | 53 |
| GH+PC+MS | 57 | 68 | 41 | 50 | 31 | 31 |
| GE+GH+PC+MS | 50 | 74 | 37 | 42 | 31 | 34 |
GE = 5% Genotyping Error; GH = 50% Genetic Heterogeneity; PC = 50% Phenocopy; MS = 5% Missing Data