| Literature DB >> 30255774 |
Burcu Darst1, Corinne D Engelman1, Ye Tian2,3, Justo Lorenzo Bermejo4.
Abstract
BACKGROUND: Multiple layers of genetic and epigenetic variability are being simultaneously explored in an increasing number of health studies. We summarize here different approaches applied in the Data Mining and Machine Learning group at the GAW20 to integrate genome-wide genotype and methylation array data.Entities:
Keywords: Data mining; Epigenome-wide association study; Genome-wide association study; Machine learning
Mesh:
Year: 2018 PMID: 30255774 PMCID: PMC6157271 DOI: 10.1186/s12863-018-0646-3
Source DB: PubMed Journal: BMC Genet ISSN: 1471-2156 Impact factor: 2.797
Fig. 1Mind-map with the 5 contributions from the Data Mining and Machine Learning group. The figure shows the main approach used by each group member. For example, Islam et al. used support vector machines in addition to deep learning, and Kapusta et al. applied random forest based on results from cluster analysis. EWAS, epigenome-wide association study; GWAS, genome-wide association study
Investigated data in contributions from the Data Mining and Machine Learning group at the GAW20
| Contribution | Sample size | Real data | Simulated data | GWAS | EWAS | Investigated phenotype(s) |
|---|---|---|---|---|---|---|
| Random forest (Darst) | 680 | X | X | X | log average post-TG − log average pre-TG | |
| Deep learning (Islam) | 993/499a | X | X | pre-TG, post-TG | ||
| Cluster analysis (Kapusta) | 446 | X | X | X | relative TG difference, metabolic syndrome | |
| Mixed models (Datta) | 680 | X | X | X | post TG-pre TG | |
| Gene-set enrichment (Piette) | 680 | X | X | X | log average post TG/log average pre TG |
EWAS epigenome-wide association study, GWAS genome-wide association study, TG triglyceride concentration
aThere were 993 participants in the Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) study, but posttreatment methylation data was only available for 499
Fig. 2Recursive feature elimination–random forest applied to combined genome-wide genotype and methylation data. Recursive feature elimination was applied to random forest (RF) and consisted of the following steps: a running the random forest model; b removing features that random forest ranked in the bottom 3%; c ranking removed features starting with the lowest rank; and (d) recursively iterating until no additional features could be removed from the model. The comparison between random forest and random forest with recursive feature elimination relied on the full set of ranks
Fig. 3Deep learning model applied to genome-wide methylation data. Panel a represents an interconnected node (neuron), the basic element of artificial neural networks. a represents the n input signal into the neuron; w represents the corresponding weight of a; and b is a random bias added to avoid overfitting. The sum of multiplied input values and random bias z is transformed into an output value by a fixed activation function σ. Panel b shows the specific deep neural network model used to investigate GAW20 methylation data. The first layer (input layer) included all 463,995 CpG sites. The second and third hidden layers were configured to 500 and 250 nodes, respectively. The fourth layer (ReLu) aims to nonlinearity, and the fifth layer (Dropout) targets at overcoming overfitting
Applied filters, possibility of adjustment for familial correlation, and strengths and limitations of applied methods in the contributions from the Data Mining and Machine Learning group at the GAW20
| Contribution | Applied filters | Potential correlation adjustment | Strengths | Limitations |
|---|---|---|---|---|
| Random forest (Darst) | None | Yes | Model free; adequate for high-dimensional data | Does not work well with highly correlated variables |
| Deep learning (Islam) | Methylation variability | Yes | Robust; adequate for high-dimensional data | Difficult result interpretation, tough parameter set up, large sample sizes are needed |
| Cluster analysis (Kapusta) | Reported genome-wide association studies on metabolic syndrome and fenofibrate treatment, principal component analysis, random forest | Yes | Intuitive cluster interpretation | Previous dimension reduction can be indicated |
| Mixed models (Datta) | Mixed models modification | Yes | Simple regression framework | Not indicated for low-dimensional data |
| Gene-set enrichment (Piette) | T-tests and linear regression | No | Circumvents multiple testing | Requires biological insight |