| Literature DB >> 28524769 |
Lawrence B Holder1, M Muksitul Haque1,2, Michael K Skinner2.
Abstract
Understanding epigenetic processes holds immense promise for medical applications. Advances in Machine Learning (ML) are critical to realize this promise. Previous studies used epigenetic data sets associated with the germline transmission of epigenetic transgenerational inheritance of disease and novel ML approaches to predict genome-wide locations of critical epimutations. A combination of Active Learning (ACL) and Imbalanced Class Learning (ICL) was used to address past problems with ML to develop a more efficient feature selection process and address the imbalance problem in all genomic data sets. The power of this novel ML approach and our ability to predict epigenetic phenomena and associated disease is suggested. The current approach requires extensive computation of features over the genome. A promising new approach is to introduce Deep Learning (DL) for the generation and simultaneous computation of novel genomic features tuned to the classification task. This approach can be used with any genomic or biological data set applied to medicine. The application of molecular epigenetic data in advanced machine learning analysis to medicine is the focus of this review.Entities:
Keywords: Active learning; DNA methylation; deep learning; epigenetics; epigenome; imbalanced-class learning; machine learning; molecular diagnostics
Mesh:
Year: 2017 PMID: 28524769 PMCID: PMC5687335 DOI: 10.1080/15592294.2017.1329068
Source DB: PubMed Journal: Epigenetics ISSN: 1559-2294 Impact factor: 4.528
Figure 1.Machine Learning approaches to epigenetic data analysis: #1 ACL−ICL on manually generated features; #2 ACL-ICL on DL-generated features; #3 solely DL-based classification. Modified from.
Machine learning approaches for biological data sets, along with their function, advantages, disadvantages, and recent examples.
| Machine Learning Approach | Function | Advantages | Disadvantages |
|---|---|---|---|
| Learn a model discriminating one class of biological phenomena from one or more other classes. | Precise model with predictive and interpretative properties. | Requires equally large number of examples from each class. | |
| Learn a model descriptive of the biological phenomena in the data. | Does not require class labels on data. | Sensitive to similarity measure; results difficult to interpret. | |
| Learn model from mixture of labeled and unlabeled data. | Utilize all available data; typically outperforms use just labeled data. | Sensitive to errors in propagating class labels from labeled to unlabeled data. | |
| Reduce large number of features to fewer, more informative features. | Improves efficiency and accuracy of learning. | Sensitive to feature evaluation metric; may discard informative features. | |
| Identify most informative instances to label for accurate model learning. | Reduces number of examples needed to learn model; reduces burden on human expert and experiment cost. | May focus learner on outliers rather than prominent classes. | |
| Learn in the presence of large skew in the number of examples of each class. | Learn with relatively few examples of biological phenomenon of interest. | May underfit or overfit data depending on bias toward minority class. | |
| Learns complex representations of concepts in the data. | General purpose and high accuracy. | Sensitive to parameter choices; long training times. |
Machine Learning Applications in Epigenetics.
| Application | Observations | Literature |
|---|---|---|
| Epigenome mapping | Epigenetic site prediction | |
| Bioinformatics of complex data | Mixed cell type analysis | |
| Biological investigations | Predictions biological parameters (age, metabolism, neuroscience, evolution) | |
| Disease detection | Disease diagnostics and prognosis | |
| Exposure detection | Environmental exposure detection and impacts | |
| Technology development | Improvement and advances in epigenetic analysis |
Figure 2.Genome-wide prediction of potential epimutation sites based on promoter only DMR training sets. Chromosomal plot of germ cell data set sperm shows the predicted 3+ sites and the clusters of DMR regions. Red lines below each chromosome line indicate predicted potential DMR sites (3,233) when sperm is used as the training set; blue boxes above each line indicate clusters. Y-axis shows each of the 21 chromosomes while X-axis shows the length of the chromosome with predicted potential DMR locations and the clusters. Clusters are regions that indicate over-representations of sites within a small sub-section of the genome. Modified from.
Figure 3.CpG density plot showing number of predicted DMR sites correlated with CpG density. (a) CpG density from the potential predicted germ cell DMR sites (3,234) when sperm is used as the training set to predict genome-wide. (b) CpG density from potential predicted somatic cell DMR sites (1,502) when somatic cell is used as training set to predict genome-wide CpGs. X-axis shows the number of CpGs per 100 bases on average, while Y-axis shows the number of sites. Modified from.
Applications of machine learning and molecular epigenetics to medicine.
| Medical records and epidemiology studies |
| Molecular diagnostics for disease and disease susceptibility |
| Facilitating pharmacogenomics studies in therapy development and disease |
| Molecular diagnostics to facilitate treatment options for specific disease and medical conditions |