| Literature DB >> 18822147 |
Marylyn D Ritchie1, Alison A Motsinger-Reif2.
Abstract
During the past two decades, the field of human genetics has experienced an information explosion. The completion of the human genome project and the development of high throughput SNP technologies have created a wealth of data; however, the analysis and interpretation of these data have created a research bottleneck. While technology facilitates the measurement of hundreds or thousands of genes, statistical and computational methodologies are lacking for the analysis of these data. New statistical methods and variable selection strategies must be explored for identifying disease susceptibility genes for common, complex diseases. Neural networks (NN) are a class of pattern recognition methods that have been successfully implemented for data mining and prediction in a variety of fields. The application of NN for statistical genetics studies is an active area of research. Neural networks have been applied in both linkage and association analysis for the identification of disease susceptibility genes.In the current review, we consider how NN have been used for both linkage and association analyses in genetic epidemiology. We discuss both the successes of these initial NN applications, and the questions that arose during the previous studies. Finally, we introduce evolutionary computing strategies, Genetic Programming Neural Networks (GPNN) and Grammatical Evolution Neural Networks (GENN), for using NN in association studies of complex human diseases that address some of the caveats illuminated by previous work.Entities:
Year: 2008 PMID: 18822147 PMCID: PMC2553772 DOI: 10.1186/1756-0381-1-3
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Glossary of Statistical Genetics Terms
| Allele | One member of a series of different forms of a gene |
| Association study | The use of case-control, cohort, or even family data to statistically relate genetic variations to a disease/phenotype |
| Chromosome | A singular, physical piece of DNA, which can contain many genes and regulatory elements |
| Epistasis | Gene-gene interaction; as a deviation from additivity in the effect of alleles at different loci with respect to their contribution to a phenotype |
| Gene | A heritable unit; a region of genomic sequence which is associated with regulatory, transcribed, and/or other functional regions |
| Genotype | Specific allele combinations for an individual |
| Genotyping | The experimental determination of sequence variations |
| Linkage study | The use of genotype and phenotype information from multiple biologically-related family members to determine whether a chromosomal region is preferentially inherited by offspring with the trait of interest |
| Locus | A fixed position on a chromosome |
| Mendelian disease | A genetic disease that is caused by a single locus, and displays a pattern of inheritance in line with Mendel's Laws |
| Phenotype | A measurable trait for an individual |
| Pedigree | Multiple biologically-related individuals with known familial relationships |
| Single Nucleotide Polymorphism (SNP) | A DNA sequence variation; the smallest unit of variation in the genome |
Figure 1A Typical Feed-Forward NN. A feed-forward neural network with one input layer consisting of eight nodes (Xi), two hidden layers with four and two nodes respectively (Σ), and one output layer (O). The connections between layers have associated connection strengths or weights (ai).
Summary of NN Studies Reviewed
| Publication | Input | Output | Hidden Layer | |||
| Type | Coding | Type | Coding | Number Layers | Number Nodes | |
| Bhat et al. 1999 | Binary | 0 = absence of allele | Binary | 1/0/0 = unaffected | 1 | 15 |
| 1 = presence of allele | 0/1/0 = mildly affected | |||||
| 0/0/1 = severely affected | ||||||
| Bush et al 2005 | Discrete | -1, -1 = 1/1 genotype | Binary | 0 = unaffected | GP evolved | |
| 0, + 2 = 1/2 genotype | 1 = affected | |||||
| +1, -1 = 2/2 genotype | ||||||
| Costello et al. 2003 | Dicrete | Varied | Binary | 0 = unaffected | Multiple variations | |
| 1 = affected | ||||||
| Curtis et al. 2001 | Discrete | 0 = AA genotype | Binary | 0 = unaffected | 2 | 3 |
| 1 = AB genotype | 1 = affected | |||||
| 2 = BB genotype | ||||||
| Curtis 2007 | Discrete | 0 = AA genotype | Binary | 0 = unaffected | 2 | 3 |
| 1 = AB genotype | 1 = affected | |||||
| 2 = BB genotype | ||||||
| Giachino et al 2007 | Discrete and Continuous | Categorical values of genotypes and clinical features | Binary | 0 = unaffected | 1 | unknown |
| 1 = affected | ||||||
| Li et al. 1999 | Discrete | IBD sharing | Binary | 0/1 = concordant or not | Multiple variations | |
| +1= shared allele | 0/1 = affected or unaffected | |||||
| -1 = unshared allele | ||||||
| 0 = uninformative | ||||||
| Lin et al 2006 | Discrete | Categories of genotype combinations | Binary | 0 = non-response | Multiple variations | |
| 1 = response | ||||||
| Lucek and Ott 1997 | Binary | 0 = absence of allele | Binary | 4 nodes for each trait (20 total nodes) | 1 | 70 |
| 1 = presence of allele | 0 = quantitative trait off | |||||
| 1 = quantitative trait on | ||||||
| Lucek et al. 1998 | Discrete | IBD sharing | Binary | +1,+1 = target output | 1 | √220 |
| +1= shared allele | 0, +1 = noise | |||||
| -1 = unshared allele | ||||||
| 0 = uninformative | ||||||
| Marinov and Weeks 2001 | Discrete | IBD sharing | Binary | +1,+1 = target output | 1 | √220 |
| +1= shared allele | 0, +1 = noise | |||||
| -1 = unshared allele | ||||||
| 0 = uninformative | ||||||
| Matchenko-Shimko and Dube 2006 | Discrete | Three combinations of possible allele combinations, transformed to a 0–1 range | Binary | 0 = control | Multiple variations | |
| 1 = case | ||||||
| Motsinger et al (2006a) | Discrete | -1, -1 = 1/1 genotype | Binary | 0 = unaffected | GP Evolved | |
| 0, + 2 = 1/2 genotype | 1 = affected | |||||
| +1, -1 = 2/2 genotype | ||||||
| Motsinger et al (2006b) | Discrete | -1, -1 = 1/1 genotype | Binary | 0 = unaffected | GE Evolved | |
| 0, + 2 = 1/2 genotype | 1 = affected | |||||
| +1, -1 = 2/2 genotype | ||||||
| North et al 2003 | Discrete | 0 = AA genotype | Binary | 0 = unaffected | Multiple Variations | |
| 1 = AB genotype | 1 = affected | |||||
| 2 = BB genotype | ||||||
| Ott 2001 | Discrete | -1, -1 = 1/1 genotype | Binary | 0 = unaffected | NA | |
| 0, +2 = 1/2 genotype | 1 = affected | |||||
| +1, -1 = 2/2 genotype | ||||||
| Pankratz et al. 2001 | Discrete | IBD sharing | Binary | 1/1 = affected/affected | 1 | 4 |
| +1 = shared allele | 0/1 = affected/unaffected | |||||
| -1 = unshared allele | ||||||
| 0 = uninformative | ||||||
| Penco et al 2005 | Discrete | Categories of allele combinations at each genotype | Binary | 0 = unaffected | Multiple variations, including and evolutionary process | |
| 1 = affected | ||||||
| Pociot et al. 2004 | Discrete | Number of categories per sliding window | Binary | 0 = unaffected | Multiple variations | |
| 1 = affected | ||||||
| Ritchie et al. 2003 | Discrete | -1, -1 = 1/1 genotype | Binary | 0 = unaffected | GP evolved | |
| 0, + 2 = 1/2 genotype | 1 = affected | |||||
| +1, -1 = 2/2 genotype | ||||||
| Saccone et al. 1999 | Discrete | IBD sharing | Binary | 1/1 = affected/affected | 18 variations | |
| +1= shared allele | 0/1 = affected/unaffected | |||||
| -1 = unshared allele | ||||||
| 0 = uninformative | ||||||
| Serretti and Smeraldi 2004 | Discrete | SERPR*l/l = 1 | Binary | 0 = nonresponse | 1 | 7 |
| SERPR*l/s = 2 | 1 = response | |||||
| SERPR*s/s = 2 | ||||||
| TPH*C/C = 1 | ||||||
| TPH*C/A = 2 | ||||||
| TPH*A/A = 2 | ||||||
| Shoemaker et al. 2001 | Varied | Varied | Binary | 0 = unaffected | 1 | unknown |
| 1 = affected | ||||||
| Tomita et al 2004 | Discrete | Homozygous for major allele = (0.1, 0.1) | Binary | 0.9 = case | Multiple variations | |
| Heterozygous = (0.1, 0.9) | 0.1 = control | |||||
| Homozygous for minor allele = (0.9, 0.9) | ||||||
| Zandi et al. 2001 | Contin. | Pedigree-specific NPL scores | Binary | 1,1 = case pedigree | 15 variations | |
| 1,0 = control pedigree | ||||||
Figure 2Overview of the GPNN method (adapted from Ritchie et al. 2003). First, GPNN has a set of parameters to be initialized before beginning the evolution of NN models. Second, the data are divided into 10 equal parts for 10-fold cross-validation. Third, training begins by generating an initial population of random solutions. Fourth, each NN is evaluated on the training set and its fitness (classification error) recorded. Fifth, the best solutions are selected for crossover and reproduction using a fitness-proportionate selection technique. The new generation begins the cycle again. This continues until a stopping criterion (classification error of zero or limit on the number of generations) is met. At the end of the GPNN evolution, the overall best solution is selected as the optimal NN. Sixth, this best GPNN model is tested on the 1/10 of the data left out to estimate the prediction error of the model. Steps two through six are performed ten times with the same parameters settings, each time using a different 9/10 of the data for training and 1/10 of the data for testing. The loci that are consistently present in the GPNN models are selected as the functional loci and are used as input to a final GPNN evolutionary process to estimate the classification and prediction error of the GPNN model.
Figure 3A binary expression tree representation of a NN. This is an example of one NN optimized by GPNN. The O is the output node, Σ indicates the activation function, ai indicates a weight, and X1-X8 are the NN inputs. The C nodes are constants.