| Literature DB >> 31874637 |
Quang H Nguyen1, Thanh-Hoang Nguyen-Vo2, Nguyen Quoc Khanh Le3, Trang T T Do4, Susanto Rahardja5, Binh P Nguyen6.
Abstract
BACKGROUND: Enhancers are non-coding DNA fragments which are crucial in gene regulation (e.g. transcription and translation). Having high locational variation and free scattering in 98% of non-encoding genomes, enhancer identification is, therefore, more complicated than other genetic factors. To address this biological issue, several in silico studies have been done to identify and classify enhancer sequences among a myriad of DNA sequences using computational advances. Although recent studies have come up with improved performance, shortfalls in these learning models still remain. To overcome limitations of existing learning models, we introduce iEnhancer-ECNN, an efficient prediction framework using one-hot encoding and k-mers for data transformation and ensembles of convolutional neural networks for model construction, to identify enhancers and classify their strength. The benchmark dataset from Liu et al.'s study was used to develop and evaluate the ensemble models. A comparative analysis between iEnhancer-ECNN and existing state-of-the-art methods was done to fairly assess the model performance.Entities:
Keywords: Classification; Convolutional neural network; Deep learning; Enhancer; Ensemble; Identification; One-hot encoding
Mesh:
Year: 2019 PMID: 31874637 PMCID: PMC6929481 DOI: 10.1186/s12864-019-6336-3
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Sequence characteristics of a enhancers versus non-enhancers and b strong enhancers versus weak enhancers. Sequence analysis using logo representations were created by Two Sample Logo with t-test (p<0.05) with A, T, G, and C are colored with Green, Red, Yellow, and Blue, respectively
Results of an enhancer identification trial (trial 5 in Table 2) on the independent test dataset
| Training : Validation (Ratio 4:1) | ACC (%) | AUC (%) | SN (%) | SP (%) | MCC |
|---|---|---|---|---|---|
| Model 1 (Parts 2, 3, 4, 5 : Part 1) | 0.756 | 0.815 | 0.750 | 0.515 | |
| Model 2 (Parts 1, 3, 4, 5 : Part 2) | 0.753 | 0.829 | 0.775 | 0.730 | 0.506 |
| Model 3 (Parts 1, 2, 4, 5 : Part 3) | 0.740 | 0.825 | 0.670 | 0.485 | |
| Model 4 (Parts 1, 2, 3, 5 : Part 4) | 0.776 | 0.831 | 0.790 | 0.765 | |
| Model 5 (Parts 1, 2, 3, 4 : Part 5) | 0.746 | 0.821 | 0.745 | 0.750 | 0.495 |
| Ensemble Model | 0.790 | 0.740 | 0.531 |
The highest value for each metric is in bold
Results of an enhancer classification trial (trial 9 in Table 4) on the independent test dataset
| Training : Validation (Ratio 4:1) | ACC (%) | AUC (%) | SN(%) | SP (%) | MCC |
|---|---|---|---|---|---|
| Model 1 (Parts 2, 3, 4, 5 : Part 1) | 0.780 | 0.620 | 0.405 | ||
| Model 2 (Parts 1, 3, 4, 5 : Part 2) | 0.660 | 0.740 | 0.720 | 0.600 | 0.322 |
| Model 3 (Parts 1, 2, 4, 5 : Part 3) | 0.670 | 0.730 | 0.490 | 0.364 | |
| Model 4 (Parts 1, 2, 3, 5 : Part 4) | 0.665 | 0.715 | 0.660 | 0.330 | |
| Model 5 (Parts 1, 2, 3, 4 : Part 5) | 0.600 | 0.681 | 0.680 | 0.520 | 0.203 |
| Ensemble Model | 0.695 | 0.759 | 0.840 | 0.550 |
The highest value for each metric is in bold
Independent test identifying enhancers and non-enhancers under 10 trials
| No. of Trials | ACC (%) | AUC (%) | SN (%) | SP(%) | MCC |
|---|---|---|---|---|---|
| 1 | 0.768 | 0.831 | 0.780 | 0.755 | 0.535 |
| 2 | 0.765 | 0.834 | 0.790 | 0.740 | 0.531 |
| 3 | 0.770 | 0.835 | 0.775 | 0.765 | 0.540 |
| 4 | 0.768 | 0.831 | 0.795 | 0.740 | 0.536 |
| 5 | 0.773 | 0.832 | 0.785 | 0.760 | 0.545 |
| 6 | 0.778 | 0.837 | 0.800 | 0.755 | 0.556 |
| 7 | 0.773 | 0.832 | 0.780 | 0.765 | 0.545 |
| 8 | 0.773 | 0.832 | 0.780 | 0.765 | 0.545 |
| 9 | 0.758 | 0.830 | 0.785 | 0.730 | 0.516 |
| 10 | 0.763 | 0.830 | 0.780 | 0.745 | 0.525 |
| Mean | 0.769 | 0.832 | 0.785 | 0.752 | 0.537 |
| SD | 0.006 | 0.002 | 0.008 | 0.013 | 0.011 |
Independent test classifying strong enhancers and weak enhancers under 10 trials
| No. of Trials | ACC (%) | AUC (%) | SN (%) | SP(%) | MCC |
|---|---|---|---|---|---|
| 1 | 0.650 | 0.728 | 0.680 | 0.620 | 0.301 |
| 2 | 0.710 | 0.795 | 0.880 | 0.540 | 0.447 |
| 3 | 0.695 | 0.751 | 0.920 | 0.470 | 0.437 |
| 4 | 0.670 | 0.749 | 0.750 | 0.590 | 0.344 |
| 5 | 0.660 | 0.724 | 0.720 | 0.600 | 0.322 |
| 6 | 0.690 | 0.779 | 0.810 | 0.570 | 0.391 |
| 7 | 0.670 | 0.736 | 0.740 | 0.600 | 0.343 |
| 8 | 0.660 | 0.728 | 0.750 | 0.570 | 0.325 |
| 9 | 0.695 | 0.759 | 0.840 | 0.550 | 0.408 |
| 10 | 0.675 | 0.735 | 0.820 | 0.530 | 0.366 |
| Mean | 0.678 | 0.748 | 0.791 | 0.564 | 0.368 |
| SD | 0.019 | 0.024 | 0.076 | 0.044 | 0.050 |
Fig. 2Variation in evaluation metrics from 10 trials of independent test for a Layer 1: Enhancer Identication and b Layer 2: Enhancer Classication
Comparative analysis between results of the proposed method and other studies
| Method | ACC | AUC | SN | SP | MCC | Source | |
|---|---|---|---|---|---|---|---|
| Enhancer Identification | iEnhancer-2L | 0.730 | 0.806 | 0.710 | 0.750 | 0.460 | Liu et al., 2016 |
| EnhancerPred | 0.740 | 0.801 | 0.735 | 0.745 | 0.480 | Jia and He, 2016 | |
| iEnhancer-EL | 0.748 | 0.817 | 0.710 | 0.496 | Liu et al., 2018 | ||
| iEnhancer-ECNN | 0.752 | This study | |||||
| Enhancer Classification | iEnhancer-2L | 0.605 | 0.668 | 0.470 | 0.218 | Liu et al., 2016 | |
| EnhancerPred | 0.550 | 0.579 | 0.450 | 0.650 | 0.102 | Jia and He, 2016 | |
| iEnhancer-EL | 0.610 | 0.680 | 0.540 | 0.680 | 0.222 | Liu et al., 2018 | |
| iEnhancer-ECNN | 0.564 | This study |
Values which are significantly higher than the others are in bold
Fig. 3Overview of the model development
Data distribution of 5 parts in the development set for identifying enhancers and non-enhancers
| Part | Non-enhancers | Enhancers | |
|---|---|---|---|
| Strong | Weak | ||
| 1 | 301 | 151 | 142 |
| 2 | 295 | 153 | 146 |
| 3 | 295 | 148 | 151 |
| 4 | 292 | 153 | 149 |
| 5 | 301 | 137 | 154 |
Data distribution of 5 parts in the development set for classifying strong enhancers and weak enhancers
| Part | Number of enhancers | |
|---|---|---|
| Strong | Weak | |
| 1 | 150 | 147 |
| 2 | 154 | 143 |
| 3 | 146 | 151 |
| 4 | 148 | 149 |
| 5 | 144 | 152 |
The corresponding code of each nucleic acid in one-hot encoding
| Nucleic Acid | Code |
|---|---|
| ‘A’ | [ 1 0 0 0 ] |
| ‘C’ | [ 0 1 0 0 ] |
| ‘G’ | [ 0 0 1 0 ] |
| ‘T’ | [ 0 0 0 1 ] |
Fig. 4Architecture of the proposed CNN models