| Literature DB >> 33808317 |
Runtao Yang1, Feng Wu1, Chengjin Zhang1, Lina Zhang1.
Abstract
As critical components of DNA, enhancers can efficiently and specifically manipulate the spatial and temporal regulation of gene transcription. Malfunction or dysregulation of enhancers is implicated in a slew of human pathology. Therefore, identifying enhancers and their strength may provide insights into the molecular mechanisms of gene transcription and facilitate the discovery of candidate drug targets. In this paper, a new enhancer and its strength predictor, iEnhancer-GAN, is proposed based on a deep learning framework in combination with the word embedding and sequence generative adversarial net (Seq-GAN). Considering the relatively small training dataset, the Seq-GAN is designed to generate artificial sequences. Given that each functional element in DNA sequences is analogous to a "word" in linguistics, the word segmentation methods are proposed to divide DNA sequences into "words", and the skip-gram model is employed to transform the "words" into digital vectors. In view of the powerful ability to extract high-level abstraction features, a convolutional neural network (CNN) architecture is constructed to perform the identification tasks, and the word vectors of DNA sequences are vertically concatenated to form the embedding matrices as the input of the CNN. Experimental results demonstrate the effectiveness of the Seq-GAN to expand the training dataset, the possibility of applying word segmentation methods to extract "words" from DNA sequences, the feasibility of implementing the skip-gram model to encode DNA sequences, and the powerful prediction ability of the CNN. Compared with other state-of-the-art methods on the training dataset and independent test dataset, the proposed method achieves a significantly improved overall performance. It is anticipated that the proposed method has a certain promotion effect on enhancer related fields.Entities:
Keywords: convolutional neural network; enhancer; sequence generative adversarial net; word embedding
Year: 2021 PMID: 33808317 PMCID: PMC8036415 DOI: 10.3390/ijms22073589
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1The sub-processes of DNA transcription.
Figure 2Gene regulations of enhancers.
Figure 3The flowchart of the proposed method iEnhancer-GAN.
Performance comparisons of different word segmentation methods without Seq-GAN on the training dataset.
| Layer | Word Segmentation Method |
|
|
|
|
|---|---|---|---|---|---|
| First Layer | Overlapped 3-Gram |
|
| 0.786 |
|
| Non-Overlapped 3-Gram | 0.773 | 0.777 | 0.769 | 0.546 | |
| Word Segmentation | 0.784 | 0.765 |
| 0.569 | |
| Second Layer | Overlapped 3-Gram |
| 0.714 |
| 0.350 |
| Non-Overlapped 3-Gram | 0.659 | 0.715 | 0.602 | 0.320 | |
| Word Segmentation |
|
| 0.613 |
|
Performance comparisons of different word segmentation methods without Seq-GAN on the independent test dataset.
| Layer | Word Segmentation Method |
|
|
|
|
|---|---|---|---|---|---|
| First Layer | Overlapped 3-Gram | 0.752 | 0.781 | 0.724 | 0.539 |
| Non-Overlapped 3-Gram | 0.762 | 0.784 | 0.741 | 0.552 | |
| Word Segmentation |
|
|
|
| |
| Second Layer | Overlapped 3-Gram | 0.718 | 0.843 |
| 0.484 |
| Non-Overlapped 3-Gram |
| 0.896 | 0.560 | 0.523 | |
| Word Segmentation | 0.724 |
| 0.531 |
|
Figure 4Comparisons between the actual DNA sequences and the generated DNA sequences on nucleotide compositions and mean values of some physicochemical properties. (a) The overall frequencies of the nucleotides for the actual DNA sequences and the generated DNA sequences. (b) Mean values of some physicochemical properties for the actual DNA sequences and the generated DNA sequences.
Prediction results with and without Seq-GAN on the training dataset.
| Layer | Method |
|
|
|
|
|---|---|---|---|---|---|
| First Layer (Enhancer Identification) | Without Seq-GAN | 0.784 | 0.765 | 0.803 | 0.569 |
| With Seq-GAN |
|
|
|
| |
| Second Layer (Enhancer Strength Identification) | Without Seq-GAN | 0.675 | 0.737 | 0.613 | 0.353 |
| With Seq-GAN |
|
|
|
|
Prediction results with and without Seq-GAN on the independent test dataset.
| Layer | Method |
|
|
|
|
|---|---|---|---|---|---|
| First Layer (Enhancer Identification) | Without Seq-GAN | 0.772 | 0.799 | 0.746 |
|
| With Seq-GAN |
|
|
| 0.567 | |
| Second Layer (Enhancer Strength Identification) | Without Seq-GAN | 0.724 | 0.917 | 0.531 |
|
| With Seq-GAN |
|
|
| 0.505 |
The prediction results compared with those of other methods on the training dataset.
| Layer | Method |
|
|
|
|
|---|---|---|---|---|---|
| First Layer | iEnhancer-2L [ | 0.769 | 0.781 | 0.759 | 0.540 |
| iEnhancer-PsedeKNC [ | 0.768 | 0.773 | 0.763 | 0.540 | |
| EnhancerPred [ | 0.732 | 0.726 | 0.738 | 0.464 | |
| iEnhancer-EL [ | 0.780 | 0.757 | 0.804 | 0.561 | |
| iEnhancer-5Step [ | 0.823 | 0.811 | 0.835 | 0.650 | |
| iEnhancer-ECNN [ | 0.769 | 0.785 | 0.752 | 0.537 | |
| iEnhancer-CNN [ | 0.806 | 0.759 | 0.889 | 0.693 | |
| iEnhancer-XG [ | 0.811 | 0.757 | 0.865 | 0.627 | |
| iEnhancer-GAN [This Study] |
|
|
|
| |
| Second Layer | iEnhancer-2L [ | 0.619 | 0.622 | 0.618 | 0.240 |
| iEnhancer-PsedeKNC [ | 0.634 | 0.626 | 0.644 | 0.270 | |
| EnhancerPred [ | 0.621 | 0.627 | 0.615 | 0.241 | |
| iEnhancer-EL [ | 0.650 | 0.690 | 0.611 | 0.315 | |
| iEnhancer-5Step [ | 0.681 | 0.753 | 0.608 | 0.370 | |
| iEnhancer-ECNN [ | 0.678 | 0.791 | 0.564 | 0.368 | |
| iEnhancer-CNN [ | 0.764 | 0.436 | 0.768 | 0.451 | |
| iEnhancer-XG [ | 0.667 | 0.749 | 0.586 | 0.340 | |
| iEnhancer-GAN [This Study] |
|
|
|
|
The prediction results compared with those of other methods on the independent test dataset.
| Layer | Method |
|
|
|
|
|---|---|---|---|---|---|
| First Layer | iEnhancer-2L [ | 0.730 | 0.750 | 0.710 | 0.460 |
| EnhancerPred [ | 0.740 | 0.735 | 0.745 | 0.480 | |
| iEnhancer-EL [ | 0.748 | 0.710 | 0.785 | 0.496 | |
| iEnhancer-5Step [ |
|
| 0.760 | 0.580 | |
| iEnhancer-CNN [ | 0.775 | 0.783 |
|
| |
| iEnhancer-XG [ | 0.667 | 0.749 | 0.586 | 0.340 | |
| iEnhancer-GAN [This Study] | 0.784 | 0.811 | 0.758 | 0.567 | |
| Second Layer | iEnhancer-2L [ | 0.605 | 0.470 |
| 0.218 |
| EnhancerPred [ | 0.550 | 0.450 | 0.650 | 0.102 | |
| iEnhancer-EL [ | 0.610 | 0.540 | 0.680 | 0.222 | |
| iEnhancer-5Step [ | 0.635 | 0.740 | 0.530 | 0.280 | |
| iEnhancer-CNN [ | 0.750 | 0.653 | 0.761 | 0.323 | |
| iEnhancer-XG [ | 0.667 | 0.749 | 0.586 | 0.340 | |
| iEnhancer-GAN [This Study] |
|
| 0.537 |
|
Figure 53-gram word segmentation. The DNA subsequence in the bracket represent a “word”. (a) Overlapped 3-gram word segmentation. (b) Non-overlapped 3-gram word segmentation.
Figure 6The skip-gram model based on negative sampling.
Figure 7The architecture of the proposed convolutional neural network (CNN).
Figure 8The architecture of the sequence generative adversarial net.