| Literature DB >> 30374023 |
Anand Pratap Singh1, Sarthak Mishra1, Suraiya Jabin2.
Abstract
Regulatory elements play a critical role in development process of eukaryotic organisms by controlling the spatio-temporal pattern of gene expression. Enhancer is one of these elements which contributes to the regulation of gene expression through chromatin loop or eRNA expression. Experimental identification of a novel enhancer is a costly exercise, due to which there is an interest in computational approaches to predict enhancer regions in a genome. Existing computational approaches to achieve this goal have primarily been based on training of high-throughput data such as transcription factor binding sites (TFBS), DNA methylation, and histone modification marks etc. On the other hand, purely sequence based approaches to predict enhancer regions are promising as they are not biased by the complexity or context specificity of such datasets. In sequence based approaches, machine learning models are either directly trained on sequences or sequence features, to classify sequences as enhancers or non-enhancers. In this paper, we derived statistical and nonlinear dynamic features along with k-mer features from experimentally validated sequences taken from Vista Enhancer Browser through random walk model and applied different machine learning based methods to predict whether an input test sequence is enhancer or not. Experimental results demonstrate the success of proposed model based on Ensemble method with area under curve (AUC) 0.86, 0.89, and 0.87 in B cells, T cells, and Natural killer cells for histone marks dataset.Entities:
Mesh:
Substances:
Year: 2018 PMID: 30374023 PMCID: PMC6206163 DOI: 10.1038/s41598-018-33413-y
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Flow Chart of the Proposed system.
Figure 2Random walk in enhancers.
Figure 3Comparison of features of training dataset using Histograms.
Figure 4Comparison of features of training dataset using Box Plot and P-value.
Comparison of features of Positive and Negative samples.
| Feature name | For positive samples (i.e. enhancer) | For negative samples (i.e. non-enhancer) | Test Statistic | p-value | ||
|---|---|---|---|---|---|---|
| SD | Mean | SD | Mean | |||
| ‘GGCAG’ | 3.63 | 2.88 | 2.37 | 1.99 | 13.95 | 5.77e-44 |
| ‘TGTT’ | 7.31 | 10.76 | 6.82 | 9.45 | 7.55 | 4.55e-14 |
| Dfa | 0.05 | 1.57 | 0.06 | 1.60 | −16.98 | 4.67e-64 |
| Rvntsl | 0.03 | 0.07 | 0.05 | 0.10 | −27.71 | 5.04e-165 |
Figure 5Experimental Design of Proposed System.
Features selected for Implementation of Computational Methods.
| Bagged Tree based Ensemble method | RUSBoosted Tree based Ensemble method | CNN |
|---|---|---|
| Classifier: Bagged Trees | Ensemble method: RUSBoost | Input dense layer of CNN has 50 units, hidden dense layer has 10 neurons and activation function ‘tanh’, output layer has 2 neurons (1 for each classs with activation function ‘softmax’ |
Performance Comparison of Proposed Model with different classifiers.
| Classification Methods | On true enhancer data from VISTA Enhancer Browser | |
|---|---|---|
| AUC | Accuracy | |
| Bagged Tree based Ensemble method | 0.91 | 93.3% |
| RUSBoosted Tree based Ensemble method | 0.90 | 91.3% |
| Convolutional Neural Network (CNN) | 0.90 | 92.4% |
Figure 6Performance of Model for different Classifiers on test data from VISTA Enhancer Browser.
Figure 7Performance of Model (Bagged Tree based Ensemble method) on Histone test data.
Performance of Proposed Model for histone test dataset.
| Classification Method | B cells | T cells | Natural killer cells | |||
|---|---|---|---|---|---|---|
| AUC | Accuracy | AUC | Accuracy | AUC | Accuracy | |
| Bagged Tree based Ensemble method |
|
|
|
|
|
|
| RUSBoosted Tree based Ensemble method | 0.66 | 66.4% | 0.73 | 72.0% | 0.67 | 68.3% |
| Convolutional Neural Network (CNN) | 0.75 | 74.3% | 0.72 | 73.6% | 0.71 | 71.0% |
Performance Comparison with other Methods in literature (with different dataset used in the proposed work).
| Authors | Datasets used | Features used | Method used | AUC/Accuracy (Acc) |
|---|---|---|---|---|
| Bu, H., Gan, Y., Wang, Y., Zhou, S., & Guan, J.[ | Histone modification | DNA sequence compositional | Deep Belief Network | Acc 92.0% |
| Yang, B., Liu, F., Ren, C., Ouyang, Z., Xie, Z., Bo, X., & Shu, W.[ | Human and mouse noncoding fragments in the VISTA Enhancer Browser | DNA sequence alone | Deep-learning-based hybrid | AUC 0.956 |
| Liu, F., Li, H., Ren, C., Bo, X., & Shu, W.[ | Histone modifications (ChIPSeq), | 1,114-dimensional | Deep learning | Acc 97.65% |
| Kim, S. G., Harwani, M., Grama, A., & Chaterji, S.[ | Chromatin features | p300 binding sites, as enhancers, and TSS and random non-DHS sites, as non-enhancers. We perform same-cell and cross-cell predictions to quantify the validation rate and compare against two state-of-the-art methods, DEEP-ENCODE and RFECS | Deep neural network (DNN) | Acc 91.6% |
| Liu, B., Fang, L., Long, R., Lan, X., & Chou, K. C.[ | Chromatin state information of nine cell lines, including H1ES, K562, GM12878, HepG2, HUVEC, HSMM, NHLF, NHEK and HMEC | Physical structural | SVM classification with RBF kernel function | Acc 76.89%, |
| Kleftogiannis, D., Kalnis, P., & Bajic, V. B.[ | Histone modification marks | Sequence characteristics | Ensemble SVM | Acc 90.2% |
| Rajagopal, N., Xie, W., Li, Y., Wagner, U., Wang, W., Stamatoyannopoulos, J., & Ren, B.[ | 24 Histone modifications in two distinct human cell types, embryonic stem cells and lung fibroblasts | p300 ENCODE data in H1 | Random forests | Acc 95% |
| Fernandez, M., & Miranda-Saavedra, D.[ | Histone epigenetic marks | Optimum combination of Epigenetic profiles | Genetic algorithm optimized Support vector machines | Acc 85.1% AUC 0.966 |
| Proposed method | VISTA Enhancer Browser (experimentally validated hg19) | K-mer frequency, Statistical and Non-linear features (sd, dfa, hurst, sampan, ac, rvntsl, ac_200, ac_300) | Ensemble Method (Bagged Tree) | Acc 93.3%, AUC 0.91 on test data from VISTA Enhancer Browser |
Figure 8Flow graph of designed Enhancer prediction tool.