| Literature DB >> 31024615 |
Mhaned Oubounyt1, Zakaria Louadi1, Hilal Tayara1, Kil To Chong2.
Abstract
The promoter region is located near the transcription start sites and regulates transcription initiation of the gene by controlling the binding of RNA polymerase. Thus, promoter region recognition is an important area of interest in the field of bioinformatics. Numerous tools for promoter prediction were proposed. However, the reliability of these tools still needs to be improved. In this work, we propose a robust deep learning model, called DeePromoter, to analyze the characteristics of the short eukaryotic promoter sequences, and accurately recognize the human and mouse promoter sequences. DeePromoter combines a convolutional neural network (CNN) and a long short-term memory (LSTM). Additionally, instead of using non-promoter regions of the genome as a negative set, we derive a more challenging negative set from the promoter sequences. The proposed negative set reconstruction method improves the discrimination ability and significantly reduces the number of false positive predictions. Consequently, DeePromoter outperforms the previously proposed promoter prediction tools. In addition, a web-server for promoter prediction is developed based on the proposed methods and made available at https://home.jbnu.ac.kr/NSCL/deepromoter.htm.Entities:
Keywords: DeePromoter; bioinformatics; convolutional neural network; deep learning; promoter
Year: 2019 PMID: 31024615 PMCID: PMC6460014 DOI: 10.3389/fgene.2019.00286
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Statistics of the four datasets used in this study.
| Human-TATA | 3,065 | 3,065 | 300 | –249~50 |
| Human-non-TATA | 26,532 | 26,532 | 300 | –249~50 |
| Mouse-TATA | 3,305 | 3,305 | 300 | –249~50 |
| Mouse-non-TATA | 21,804 | 21,804 | 300 | –249~50 |
Figure 1Illustration of the negative set construction method. Green represents the randomly conserved subsequences while red represents the randomly chosen and substituted ones.
Figure 2The sequence logo in human TATA promoter for both positive set (A) and negative set (B). The plots show the conservation of the functional motifs between the two sets.
Figure 3The sequence logo in mouse TATA promoter for both positive set (A) and negative set (B). The plots show the conservation of the functional motifs between the two sets.
Figure 4The architecture of the proposed DeePromoter model.
Figure 5The effect of different conservation ratios of TATA motif in the negative set on the performance in case of TATA promoter dataset for both human (A) and mouse (B).
Comparison of the DeePromoter with the state-of-the-art method.
| DeePromoter | ||||
| Human TATA | CNNProm | 0.75 | 0.91 | 0.62 |
| DeePromoter | ||||
| Human non-TATA | CNNProm | 0.58 | 0.83 | 0.26 |
| DeePromoter | 0.95 | |||
| Mouse TATA | CNNProm | 0.68 | 0.56 | |
| DeePromoter | ||||
| Mouse non-TATA | CNNProm | 0.54 | 0.86 | 0.17 |
Figure 6The saliency map of the region –40 bp to 10 bp, which includes the TATA-box, in case of human TATA promoter sequences.
Figure 7The saliency map of the region –40 bp to 10 bp, which includes the TATA-box, in case of mouse TATA promoter sequences.