Literature DB >> 28524769

Machine learning for epigenetics and future medical applications.

Lawrence B Holder¹, M Muksitul Haque^1,2, Michael K Skinner².

Abstract

Understanding epigenetic processes holds immense promise for medical applications. Advances in Machine Learning (ML) are critical to realize this promise. Previous studies used epigenetic data sets associated with the germline transmission of epigenetic transgenerational inheritance of disease and novel ML approaches to predict genome-wide locations of critical epimutations. A combination of Active Learning (ACL) and Imbalanced Class Learning (ICL) was used to address past problems with ML to develop a more efficient feature selection process and address the imbalance problem in all genomic data sets. The power of this novel ML approach and our ability to predict epigenetic phenomena and associated disease is suggested. The current approach requires extensive computation of features over the genome. A promising new approach is to introduce Deep Learning (DL) for the generation and simultaneous computation of novel genomic features tuned to the classification task. This approach can be used with any genomic or biological data set applied to medicine. The application of molecular epigenetic data in advanced machine learning analysis to medicine is the focus of this review.

Entities: Chemical Disease Gene Species

Keywords: Active learning; DNA methylation; deep learning; epigenetics; epigenome; imbalanced-class learning; machine learning; molecular diagnostics

Mesh：

Year: 2017 PMID： 28524769 PMCID： PMC5687335 DOI： 10.1080/15592294.2017.1329068

Source DB: PubMed Journal: Epigenetics ISSN： 1559-2294 Impact factor: 4.528

Machine Learning Active Learning Imbalanced Class Learning Deep Learning Differentially methylated DNA region Active Learner with Generalized Queries Tree-Augmented Naïve-Bayes Most Informative Positive class Based Active Learning Subset Sample Optimization Single Nucleotide Polymorphism Artificial Immune Systems Support Vector Machines (Standard ML approach)

Introduction

Epigenetics is defined as “molecular factors around DNA that regulate genome activity independent of DNA sequence, and are mitotically stable”. In 1942, Conrad Waddington coined the term ‘epigenetics’ using studies of how environment influences development in conjunction with genotype, which leads to the development of the phenotype. Each cell type has a unique epigenome that allows a specific differentiation for the cell. Since a single genotype can be associated with many phenotypes, it is believed that for a single genome sequence infinite epigenomes may exist. One of the main epigenetic mechanisms is DNA methylation, which can influence gene expression without changing the DNA sequence. Additional epigenetic mechanisms include histone modifications, noncoding RNA (ncRNA), and chromatin structure. DNA methylation is one of the primary studied epigenetic mechanisms that has been shown to mediate generational inheritance through the male germ line. A number of studies show that epigenetic changes are essential for developmental processes (e.g., tissue formation, organ formation, sex determination). Epigenetic changes also lead to altered patterns of gene expression that can lead to adverse clinical outcomes, such as obesity, allergies, cancer, schizophrenia, or Alzheimer disease, to name a few. Recent epigenetic studies focus on how an environmental compound or exposure can promote an epigenetic disease state that can be transmitted through generations., Predicting regions of susceptibility to epigenetic changes that are associated with disease is crucial to understand epigenetics, biology, and disease. A major goal of research in this area is to identify regions in the genome that are susceptible to epigenetic modification. This can include DNA methylation changes (e.g., CpG), histone modifications, ncRNA expression, or chromatin structural changes (e.g., nucleosome positioning). We have started to understand some of the underlying theory of epigenetics and the computational approaches necessary to identify regions that are associated for these changes. However, the extraction of biological data are time consuming and expensive due to the challenges of implementing experimental procedures that can produce epigenetic phenomena and several computational challenges to extract and analyze this data. Biological data sets have high dimensionality, but the cases of interest (e.g., disease states) are relatively rare. In epigenetic data sets, for example, DNA methylation data contain only a few differentially methylated DNA regions (DMRs) and many non-DMR sites, while both are described with numerous DNA sequence and genomic features. To address these challenges, an integrated approach that combines feature generation, feature selection, and machine learning on epigenetic data sets is needed. We envision 3 alternative approaches to this integration that involve combinations of Active Learning (ACL) to address the expense of generating epigenetic data, Imbalanced Class Learning (ICL) to address the relatively low occurrence of epimutations in the data, and Deep Learning (DL) to address the difficulty in manually defining relevant genomic features. Fig. 1 depicts these 3 alternative approaches: (i) ACL and ICL are used to learn efficiently from manually generated features; (ii) DL is used to automatically generate features for ACL/ICL and; (iii) a solely DL based approach incorporates these ML components into one paradigm that holds promise of applying recent dramatic successes in deep learning of sequential data to epigenetics. A list of the various types of Machine Learning (ML) approaches with their advantages and disadvantages are pointed out in Table 1.

Figure 1.

Machine Learning approaches to epigenetic data analysis: #1 ACL−ICL on manually generated features; #2 ACL-ICL on DL-generated features; #3 solely DL-based classification. Modified from.

Table 1.

Machine learning approaches for biological data sets, along with their function, advantages, disadvantages, and recent examples.

Machine Learning Approach	Function	Advantages	Disadvantages
Supervised Learning (e.g., support vector machine,⁸⁴ random forest⁸⁵)	Learn a model discriminating one class of biological phenomena from one or more other classes.	Precise model with predictive and interpretative properties.	Requires equally large number of examples from each class.
Unsupervised Learning (e.g., K-means,⁸⁶ hierarchical clustering⁸⁷)	Learn a model descriptive of the biological phenomena in the data.	Does not require class labels on data.	Sensitive to similarity measure; results difficult to interpret.
Semi-supervised Learning (e.g., transduction⁸⁸)	Learn model from mixture of labeled and unlabeled data.	Utilize all available data; typically outperforms use just labeled data.	Sensitive to errors in propagating class labels from labeled to unlabeled data.
Feature Selection (e.g., PCA,⁸⁹ LDA,⁹⁰ wrapper⁹¹)	Reduce large number of features to fewer, more informative features.	Improves efficiency and accuracy of learning.	Sensitive to feature evaluation metric; may discard informative features.
Active Learning (e.g., uncertainty sampling,⁹² most informative instance⁹³)	Identify most informative instances to label for accurate model learning.	Reduces number of examples needed to learn model; reduces burden on human expert and experiment cost.	May focus learner on outliers rather than prominent classes.
Imbalanced class Learning (e.g., minority over-sampling,⁹⁴ boosting⁹⁵)	Learn in the presence of large skew in the number of examples of each class.	Learn with relatively few examples of biological phenomenon of interest.	May underfit or overfit data depending on bias toward minority class.
Deep Learning (DeepBind,¹⁴ DeepMotif¹⁵)	Learns complex representations of concepts in the data.	General purpose and high accuracy.	Sensitive to parameter choices; long training times.

Machine Learning approaches to epigenetic data analysis: #1 ACL−ICL on manually generated features; #2 ACL-ICL on DL-generated features; #3 solely DL-based classification. Modified from. Machine learning approaches for biological data sets, along with their function, advantages, disadvantages, and recent examples. The following sections first describe the main ML techniques recommended for prediction over epigenomic data: active learning, imbalanced class learning, and deep learning. Next, the application of these techniques to biological data sets, in general, and epigenetic data sets, in particular, are discussed along with the commensurate challenges. Recent results using active learning and imbalanced class learning for epigenetic feature selection and prediction are presented. Lastly, future applications of these ML techniques to molecular diagnostics and medicine are discussed.

Machine learning

Active learning

Biological and molecular data generally comes in raw form and needs to be annotated with class labels. This requires a domain expert and making the best use of the expert's knowledge and time. A new ML approach has arisen called Active Learning (ACL). ACL is designed to maximize the potential of the Oracle (the human expert) in labeling data by selecting only relevant instances and features. Instead of labeling all the instances, ACL methods can intelligently choose a small number of instances in a few iterations that quickly trains the learner while minimizing costs. ACL can produce better classifiers in less time and iterations. In traditional ACL methods, it may not always be easy for the Oracle to label a query with many features, especially those with high precision values. Many of the queries may contain irrelevant features that have no effect on the final outcome (the label). A better approach is to remove some of the irrelevant features for a certain query such that it results in a shorter and more readable query, which is easier for the Oracle to label. Using such generalized queries will help achieve higher accuracy with fewer queries than traditional ACL methods. However, an overly general query may lead to an uncertain label, which may add noise to the learning process. Therefore, the goal of an ACL system should be to produce generalized queries with highly certain answers, from which it can learn a classifier quickly with fewer examples. Even when labeled data abounds, selecting a subset of the features on which to train the learner may result in a better classifier. This capability of ACL improves our ability to select feature subsets from both manually and automatically generated feature sets. The most common and widely used form of ACL is uncertainty sampling, which chooses the most uncertain example as the next one for the Oracle to label. One problem with uncertainty sampling is that it may choose outliers, which are highly uncertain data points. Therefore, it does not always follow the underlying distribution of data points. The Active Learner with Generalized Queries (AGQ+) is an important ACL method that automatically generates meaningful new features, unlike previous approaches in which new features are manually adjusted. AGQ+ also constructs generalized queries with numeric attribute ranges that are automatically produced from raw numeric attribute data. In recent work, we introduced an ACL method called GQAL, which is similar to AGQ+ but performs a local feature selection per query and achieves superior performance on classifying epigenetic data sets. GQAL uses pool-based uncertainty sampling for constructing a generalized query with don't-care features (irrelevant features in the most uncertain examples). GQAL uses the Tree-Augmented Naïve-Bayes (TAN) learner. TAN has been found to be superior to other learners in this setting, and it provides a probability of classification that is used by GQAL to find the instance whose classification probability is farthest from its distribution in the instance set (i.e., the most uncertain instance). GQAL generalizes the uncertain instance by identifying sets of features whose permuted values have no effect on the prediction for that instance. More detailed results of GQAL on epigenetic data are discussed later.

Imbalanced class learning

In many data sets, there are unequal numbers of instances in each class making an unequal distribution of samples. In this case, the classifier learns most of the target concepts of the majority class, but learns target concepts from the minority class poorly or not at all. Often, the interest is more on the minority class, such that getting rare instances from the minority class can be time consuming and costly. Such an unequal distribution between classes of a data set is known as the class imbalance problem. In recent work, we introduced the TAN+AdaBoost ICL method that uses all the majority and minority class samples and uses boosting to ensure that each class is learned with equal priority. TAN+AdaBoost uses adaptive boosting (AdaBoost), which learns a set, or ensemble, of classifiers using a base classifier (in this case TAN) repeatedly applied to the data set, but with incorrectly-classified examples receiving more weight to bias later classifiers toward correct classifications. While initial classifiers focus on the majority class, later classifiers focus on the minority class. The final classifier consists of the weighted majority vote of the individual classifiers. When applied to the 2 epigenetic data sets (sperm and somatic, described in Fig. 1), TAN+AdaBoost achieved the best overall performance compared with other imbalanced class learners using the combined average for AUC, F-measure, and G-mean, which are popular performance measures for ICL problems.

Deep learning

Deep Learning (DL) has recently demonstrated superior performance in several domains, most notably in image, speech, and natural language processing. Much of DL power resides in its ability to generate complex features while either learning to encode the input data in an unsupervised setting or learning to classify the input data in a supervised setting. The complex feature generation is accomplished using a multi-layer (deep) neural network with specialized nodes, e.g., convolutional input nodes to include neighboring information from around an input data point (e.g., motifs), and logistic or rectified linear units at the intermediate (hidden) and output layers, which reproduce (decode) or classify the input data. The weights on the interconnections between layers are trained using the standard backpropagation method. With typically 10 or more intermediate layers, where each layer's nodes compute a complex feature based on features from previous layers, the network generates complex features for representing the input data. It is this feature generation capability that is critical to an advanced DL-based approach to classification of genomic regions. As depicted by method #2 in Fig. 1, DL can be used to generate complex features from windows over the training DNA sequences. The sequence is then annotated with these features, and this newly annotated training data can be input to an ACL-ICL method to select the best features and data for learning. In addition to DL being used to generate new features, DL can also be used to minimize the number of relevant features; this would assist active learning. The generalized query based ACL technique becomes extremely computationally expensive when the number of features increases. As depicted by method #3 in Fig. 1, the DL network itself can be trained to identify genomic regions of interest, and then the learned network can be applied to the whole genome.

Machine learning in biological datasets

Machine Learning applied to biological data sets has a long history of success dating back to before 1990. A complete survey is beyond the scope of this article, but Table 1 summarizes the major ML techniques and specific approaches applied to biological data sets that are also applied to epigenetic data in particular, as described in the next section. Table 1 also describes the main advantages and disadvantages of each approach, as well as cites some recent examples from the literature. Supervised algorithms are used when there are labeled examples of 2 or more classes of interest (e.g., disease vs. healthy). Support vector machines and random forests of decision trees are among the most popular methods. Supervised algorithms have been used for the prediction of gene ontology and gene expression profiles across different environmental and experimental conditions. Unsupervised algorithms are used when the samples are not labeled. K-mean clustering and hierarchical clustering have been widely used in biological data sets. Chromatic data has been used with unsupervised learning algorithms for annotating the genomes to identify novel groups of functional elements. Semi-supervised algorithms fall between supervised and unsupervised, especially for cases when only a small portion of the samples is labeled. Semi-supervised algorithms have been used to identify functional relationships between genes and transcription factor binding sites. They are widely used for gene-finding approaches where the entire genome is the unlabeled set and only a collection of genes is annotated. Tentative labels are given after a first pass and the algorithm iterates to improve the learning model. Feature selection methods, such as principal component analysis, linear discriminant analysis, and wrapper methods, seek to reduce the dimensionality of data sets, identify informative features, and remove irrelevant features, to avoid overfitting the learned model. Both ACL and ICL methods have applications in biological data sets. Retrieving good biological data can take months to years. Often, when experiments are done, researchers seek specific cases having a low incidence rate. So most biological data are naturally imbalanced. For example, among the 27,000 mouse genes, an experiment may observe only about 100 whose DNA methylation was changed within the experimental settings. Therefore, collecting data on such changes is a time consuming, multi-step process, and, naturally, results in a class imbalance problem. Building a classifier based on such few instances requires the learner to choose instances and features that are most informative. By choosing few instances and features, if the learner can learn the target concept quickly, then a good classifier can be found without running more extensive experiments to obtain more rare instances. To address this issue, popular ML techniques, such as oversampling or undersampling, are used. However, these approaches have their own drawbacks. Oversampling the minority class leads to overfitting, whereas undersampling the majority class leads to underfitting. Instead, ML techniques like ACL certainly can help here. Therefore, both ACL and ICL methods have applications in biological data sets. Both types of methods have been widely used in other domains but, in biological data sets, only a few studies show the use of ACL, and even fewer studies show the use of the ICL methods in practice. One ACL study used the Most Informative Positive (MIP) ACL method to find p53 mutants (mutated p53 is responsible for half of human cancers). In their ACL method, they train their classifier by only using positive instances that pass a certain score (which ranks all unlabeled instances) and include negative instances in the training set only if there are too few positive instances. Their approach looked for functionally active examples and, in their first in vivo experiment, the authors show that their MIP approach significantly increased discovery of novel positive mutants. A different study uses ACL techniques to annotate digital histopathology data. Their method, class Based Active Learning (CBAL), uses a mathematical model that calculates the cost of building a training set with a certain size and class ratio. Among the few studies addressing the imbalance class problem in biological data sets, subset sample optimization (SSO) uses an ensemble-based approach and different sets of classifiers in its optimal training set selection procedure and another set of classifiers for classification on the test set. They have used several medical data sets from the UCI ML Repository and used a genome-wide association study (GWAS, http://gwas.nih.gov/) data set that is based on single nucleotide polymorphism (SNP) of age-related macular degeneration. The Artificial Immune Systems (AIS)-based classification algorithm has performed well on highly skewed data sets as compared with other methods that use Support Vector Machine (SVM)., Applications of DL to biological data sets have increased substantially in recent years. Much of this work has focused on biomedical imaging, but a significant number of studies have focused on genomic data. These ML tasks include protein structure prediction, protein classification, and gene expression regulation. Such applications are characterized by the computation of hundreds to thousands of predetermined features, such as motifs, which are input to a DL network. But this approach is essentially just replacing the ACL-ICL method #1 in Fig. 1 with a deep neural network. Interestingly, methods #2 and #3 avoid the predetermination of specific features, but instead allow the DL network to generate relevant features from lower-level sequence data. Some recent approaches have used DL networks to generate relevant motifs using convolution layers on windowed sequence data, such as in the DeepBind method. Other approaches have used a one-hot encoding input to a convolution layer, where each sequence window of size W is represented by a Wx4 array indicating which bases (A, G, C, T) are present in the sequence window, such as in the DeepMotif method. These methods have achieved classification performance competitive with top non-DL methods.

Epigenetics

Machine learning and epigenetics

The ability to identify regions of the genome susceptible to epimutations will greatly improve our ability to diagnose disease, and the recommended ML techniques have demonstrated this ability. Currently, diagnosis of disease is done through sign and symptoms followed by genomic testing and screening. This genomic testing can identify molecular biomarkers and can identify the risk of disease for the patients. However, personalized medicine is more about studying the genomic profile to predict and prevent the diseases a patient is predisposed to and recommend better care such patient through pharmaceuticals, lifestyle changes, and screening. Advanced experimental and computational techniques have brought us closer to realizing this goal of personalized medicine. Recent advances in epigenomic technology have allowed research involving high-throughput data and ML-based bioinformatics to make significant contributions. To identify epigenetic changes and disease prediction, several approaches are useful. These approaches combine collection of genomic features, such as epigenetic marks and genetic alterations (SNPs, copy number variations, repeat elements, transcriptomes, and motifs]. Given its increased ability to collect data and identification of epigenetic-relevant features, ML continues to improve its accuracy at investigating the epigenome and identifying epimutation sites, as well as expanding the medical applications of epigenetic-based disease diagnosis. There have been several studies using ML in epigenetics research,, (Table 2). Applications have included epigenome mapping, bioinformatics on complex data, biological investigations,, disease detection,, environmental exposure detection,, and technology development. One of the initial studies looked into finding imprinted genes in human and mouse genomes. Imprinted genes are epigenetically modified genes that are also associated with various diseases. The genome-wide prediction of imprinted murine genes focused on comprehensive profiling of the mouse genes. The research group found thousands of relevant features for better prediction of the imprinted gene by mining the DNA sequence characteristics around 100 kb upstream and downstream of the imprinted genes. They used the Equbits Foresight (www.equbits.com) classifier and predicted 722 new sites. Their study looked into 23,788 annotated autosomal mouse genes and identified 600 mouse imprinted genes. The same group later mined the human genome for new imprinted sites. They again used the Equbits Foresight with SVM and 622 features and used their own sparse multinomial logistic regression (SMLR) classifier with 820 features to predict novel human imprinted genes. Another study looks into the correlation of different features to DNA methylation of CpG islands. They mined features from 190 CpG islands from human chromosome 21 and tested it on the rest of the CpG islands in the genome to find methylated CpG islands. They looked for correlation among features and found that different methylation profiles exist not only for different tissue types but also for different diseases. Wang et al. compared a standard ML approach (SVM) to a DL autoencoder approach called DeepMethyl using several tumor cell lines to assess CpG methylation and associated genomic topological features. Results show that the DL approach can improve over SVM in some cases. Although using lower resolution (50 kb windows), these observations show the value of using ML and DL to provide insight into epigenetics.

Table 2.

Machine Learning Applications in Epigenetics.

Application	Observations	Literature
Epigenome mapping	Epigenetic site prediction	^27-32
Bioinformatics of complex data	Mixed cell type analysis	^33-35^,³⁸
Biological investigations	Predictions biological parameters (age, metabolism, neuroscience, evolution)	²⁵^,³⁶^,³⁷^,³⁹^,⁴⁰^,⁵²
Disease detection	Disease diagnostics and prognosis	⁹^,¹¹^,^41-45
Exposure detection	Environmental exposure detection and impacts	⁴^,²⁶
Technology development	Improvement and advances in epigenetic analysis	^46-51

Machine Learning Applications in Epigenetics. A previous study by the authors used a combined ACL−ICL method (Fig. 1, method #1) with previous epigenetic data sets of sperm promoter differential DMRs, termed epimutations from promoters. This involved a sequential approach of ACL followed by ICL on a gene promoter specific DMR set., The prediction for the genome-wide locations for potential DMRs identified 3,353 sites and the chromosomal locations (Fig. 2). One of the main advantages of using ACL- and ICL-based methods is that these approaches are classifier-independent; therefore, another classifier can be used for prediction purposes. Future studies will explore more advanced ML approaches and more complete genome-wide epigenetic data.

Figure 2.

Genome-wide prediction of potential epimutation sites based on promoter only DMR training sets. Chromosomal plot of germ cell data set sperm shows the predicted 3+ sites and the clusters of DMR regions. Red lines below each chromosome line indicate predicted potential DMR sites (3,233) when sperm is used as the training set; blue boxes above each line indicate clusters. Y-axis shows each of the 21 chromosomes while X-axis shows the length of the chromosome with predicted potential DMR locations and the clusters. Clusters are regions that indicate over-representations of sites within a small sub-section of the genome. Modified from.

Prediction of epigenetic states from relevant genomic features

Just as with any machine learning approach used for classification, the ML approach in epigenetics proceeds by training a classifier with relevant features, generating models, and then performing prediction on a set-aside test set. For the first phase of training, classifier-appropriate genomic features are needed, which are correlated with the label of the epigenetic data. Once the samples are properly labeled and features computed, the ML technique would build a predictive model. Genomic features can include both DNA sequence and epigenetic components. Genetic features, such as repeat elements, CpG density, response elements, or specific sequences, are all DNA sequence-based features that impact the epigenome. In contrast, epigenetic features, such as DNA methylation or histone-mediated nucleosome positioning, and transitions between euchromatin and heterochromatin can impact gene expression and genetic features. More recently, epigenetic alterations have also been shown to influence genome stability and promote genetic sequence mutations., Therefore, the high degree of integration between genetics and epigenetics suggests that both features need to be considered in machine learning. One of the main challenges of successful model building is how to use the high amount of available sample domain knowledge to guide the ML process. Having a good understanding and proper selection of genomic features is important for these kinds of tasks. Feature engineering or combining different features also needs to be considered. Appropriate pre-processing, data-cleaning, and careful selection of labeled data are important for building models with high accuracy. In the case of collecting genomic features, selection of a proper window size from which the features are collected is also important and benefits from the consideration of prior knowledge. Since epigenetic data are expensive to acquire, alternative methods, such as prediction of potential epigenetic sites from DNA sequence, can act as a guide for future experimental epigenetic research or as a substitute for the data. The same is true for any genomic research. Mining of epigenetic profiles starts with extraction of interesting properties from DNA sequence data near base regions (location of epigenetic changes in the genome). After retrieving the training set, these locations are often annotated to find the name and orientation of the gene. FASTA files are created from up to 100 kb upstream and downstream of the target genes. After construction of FASTA files for extraction of genomic features, tools such as RepeatMasker are used to find SINE, LINE, ERVL, ERV, and other repeat elements to the upstream and downstream of the base locations. One of the common ways of extracting genomic features from sequences is through identification of repeat elements. Identifying repeat elements and consensus sites helps us detect interesting patterns from these sites. Other genomic features are GC content and CpG sites. Tools such as CpGislandSearcher can be used to find CpG islands in these regions. CpG islands work as catalysts as they overlap with promoter, enhancer, and other regulatory regions. Since over-representation of CpG islands can be due evolutionarily to reduced amount of DNA methylation, which then leads to less CpG to TpG mutation, lack of CpG islands can be a predictor of DNA methylation. In the previous study (shown in Fig. 2), one of the primary features was a low CpG density at epigenetic sites (Fig. 3). These are termed CpG deserts and will be a critical feature to consider.

Figure 3.

CpG density plot showing number of predicted DMR sites correlated with CpG density. (a) CpG density from the potential predicted germ cell DMR sites (3,234) when sperm is used as the training set to predict genome-wide. (b) CpG density from potential predicted somatic cell DMR sites (1,502) when somatic cell is used as training set to predict genome-wide CpGs. X-axis shows the number of CpGs per 100 bases on average, while Y-axis shows the number of sites. Modified from. Another important class of genomic feature is the DNA sequence motif., Common patterns among biologically relevant sites can be identified using motif finding tools. Motifs are short sequences that have biologically significant predicted roles. Motifs are identified with a probability matrix for each base position such that a certain combination of those sequences matches with every sub-sequence. Some motifs are also found to be unique to DNA methylation sites. Discovery tools like Oligo, LocalMotif,, Prospector, and glam2, among other pattern discovery algorithms, have been used to find novel motifs, which are the best predictors of new DMR sites. These motifs are usually constructed by running these epigenetic sites from related experiments through some of the popular motif finding tools. For the murine imprinted gene project, the authors initially looked at 4 million genomic features, searching within a certain genomic distance. Most of these features were constructed by combining all combinations of features, ranking them based on which are more relevant, and then picking only the most relevant ones for final analysis., The above-mentioned DNA sequence characteristics (e.g., motifs, CpG islands), and many other features and techniques need to be used in the prediction of novel epigenetic sites. The amount of genomic features can be enormous, and finding relevant genomic features that help identify epigenetic sites is still a big challenge. Future research will need to develop more efficient and novel ML tools that combine computational approaches, including DL (for feature generation), ACL (to select the optimal feature set), and ICL (to improve accuracy when classes are imbalanced). Novel DL approaches tailored to predicting epigenetics alterations (or epimutations) are also promising, such as multi-dimensional convolution layers that are able to capture complex properties of the genome (e.g., CpG density) from the original sequence without formally defining such features. Although the current development and validation will involve epigenetics and genome-wide prediction of epimutations, these novel ML tools can be applied to other biological and non-biological data sets. A previous study applied promoter DMR training sets in a preliminary ML approach on the rat genome and predicted 40,000 potential DMR sites genome-wide. Future research will need to advance the ML tool by using unbiased genome-wide data and advanced feature generation and selection. The preliminary rat ML tool was also applied to the human genome in a preliminary study and identified 20,000 potential human DMR sites susceptible to environmental reprogramming. Therefore, the previous studies support the need to develop more advanced ML tools for genomic and biological data.

Medical applications

Applications of machine learning and epigenetics to medicine

Machine learning was first applied to medicine with the use of electronic health records. An example is a comparison of approaches for heart failure cases. Recently, ML has been applied to pharmacology for improved therapy and pharmaceutical treatment design., Applications of ML in cardiovascular risk prediction, radiation oncology, and metabolic disease have been reported. ML has also been applied to clinical vision science and psychiatry. The application of ML to large molecular and clinical data sets will be critical in the future and have significant applications in medicine. The applications of machine learning and molecular epigenetics to medicine are outlined in Table 3.

Table 3.

Applications of machine learning and molecular epigenetics to medicine.

Medical records and epidemiology studies

Molecular diagnostics for disease and disease susceptibility

Facilitating pharmacogenomics studies in therapy development and disease

Molecular diagnostics to facilitate treatment options for specific disease and medical conditions

Applications of machine learning and molecular epigenetics to medicine. One of the first applications of epigenomics in medicine will be the development of molecular diagnostics for specific diseases or physiologic abnormalities. A number of disease conditions have been shown to be associated with epigenome modifications. Specific epimutations have been identified and correlated with specific physiologic abnormalities, such as in cancer, neurodegenerative disorders, fertility, obesity, ovarian disease, and gonadal function. Epigenetic programming and heterogeneity may play a role in standard therapies not being useful for many of these diseases. A combination of genetic and epigenetic approaches and diagnostic development is required for proper personalized medicine treatments. With the availability of massive amounts and novel types of data, there is more need to apply ML-based computational approaches to mine this data to extract meaningful insights. Using a trial-and-error approach to compare different classifiers and ML approaches is not very useful. To improve performance, there is a significant need for additional theoretical, experimental, and practical knowledge about ML techniques and specific research domain. To realize the goal of personalized medicine, epigenetic modifications need to be identified. The prediction of genomic sites that are susceptible to epigenetic alterations will dramatically increase the potential to develop efficient molecular diagnostics for specific medical conditions. The application of ML to identify susceptible epimutation sites in the genome has been reported. Therefore, ML will not simply be used in medical records or population based epidemiology, but in the actual identification of molecular information to assist in the diagnostics and treatment of disease. We propose that the combination of Active Learning, Imbalanced Class Learning, and Deep Learning represents a promising and demonstrably successful direction toward realizing this goal.

78 in total

Review 1. DNA binding sites: representation and discovery.

Authors: G D Stormo
Journal: Bioinformatics Date: 2000-01 Impact factor: 6.937

2. iRNA-Methyl: Identifying N(6)-methyladenosine sites using pseudo nucleotide composition.

Authors: Wei Chen; Pengmian Feng; Hui Ding; Hao Lin; Kuo-Chen Chou
Journal: Anal Biochem Date: 2015-08-24 Impact factor: 3.365

Review 3. Machine learning applications in proteomics research: how the past can boost the future.

Authors: Pieter Kelchtermans; Wout Bittremieux; Kurt De Grave; Sven Degroeve; Jan Ramon; Kris Laukens; Dirk Valkenborg; Harald Barsnes; Lennart Martens
Journal: Proteomics Date: 2014-01-21 Impact factor: 3.984

Review 4. Applications of alignment-free methods in epigenomics.

Authors: Luca Pinello; Giosuè Lo Bosco; Guo-Cheng Yuan
Journal: Brief Bioinform Date: 2013-11-06 Impact factor: 11.622

5. Localized motif discovery in gene regulatory sequences.

Authors: Vipin Narang; Ankush Mittal; Wing-Kin Sung
Journal: Bioinformatics Date: 2010-03-11 Impact factor: 6.937

6. Perceptual learning in visual hyperacuity: A reweighting model.

Authors: Grigorios Sotiropoulos; Aaron R Seitz; Peggy Seriès
Journal: Vision Res Date: 2011-02-18 Impact factor: 1.886

7. waviCGH: a web application for the analysis and visualization of genomic copy number alterations.

Authors: Angel Carro; Daniel Rico; Oscar M Rueda; Ramón Díaz-Uriarte; David G Pisano
Journal: Nucleic Acids Res Date: 2010-05-27 Impact factor: 16.971

8. An active learning based classification strategy for the minority class problem: application to histopathology annotation.

Authors: Scott Doyle; James Monaco; Michael Feldman; John Tomaszewski; Anant Madabhushi
Journal: BMC Bioinformatics Date: 2011-10-28 Impact factor: 3.169

9. Using epigenomics data to predict gene expression in lung cancer.

Authors: Jeffery Li; Travers Ching; Sijia Huang; Lana X Garmire
Journal: BMC Bioinformatics Date: 2015-03-18 Impact factor: 3.169

10. CompareSVM: supervised, Support Vector Machine (SVM) inference of gene regularity networks.

Authors: Zeeshan Gillani; Muhammad Sajid Hamid Akash; M D Matiur Rahaman; Ming Chen
Journal: BMC Bioinformatics Date: 2014-11-30 Impact factor: 3.169

21 in total

1. Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities.

Authors: Marinka Zitnik; Francis Nguyen; Bo Wang; Jure Leskovec; Anna Goldenberg; Michael M Hoffman
Journal: Inf Fusion Date: 2018-09-21 Impact factor: 12.975

Review 2. Automating drug discovery.

Authors: Gisbert Schneider
Journal: Nat Rev Drug Discov Date: 2017-12-15 Impact factor: 84.694

3. Artificial intelligence, physiological genomics, and precision medicine.

Authors: Anna Marie Williams; Yong Liu; Kevin R Regner; Fabrice Jotterand; Pengyuan Liu; Mingyu Liang
Journal: Physiol Genomics Date: 2018-01-26 Impact factor: 3.107

4. Predicting High Blood Pressure Using DNA Methylome-Based Machine Learning Models.

Authors: Thi Mai Nguyen; Hoang Long Le; Kyu-Baek Hwang; Yun-Chul Hong; Jin Hee Kim
Journal: Biomedicines Date: 2022-06-14

5. Predictive Modeling for Metabolomics Data.

Authors: Tusharkanti Ghosh; Weiming Zhang; Debashis Ghosh; Katerina Kechris
Journal: Methods Mol Biol Date: 2020

Review 6. Computational methods and next-generation sequencing approaches to analyze epigenetics data: Profiling of methods and applications.

Authors: Itika Arora; Trygve O Tollefsbol
Journal: Methods Date: 2020-09-14 Impact factor: 3.608