| Literature DB >> 36192696 |
Thanh-Hoang Nguyen-Vo1, Quang H Trinh2, Loc Nguyen1, Phuong-Uyen Nguyen-Hoang3, Susanto Rahardja4,5, Binh P Nguyen6.
Abstract
BACKGROUND: Promoters, non-coding DNA sequences located at upstream regions of the transcription start site of genes/gene clusters, are essential regulatory elements for the initiation and regulation of transcriptional processes. Furthermore, identifying promoters in DNA sequences and genomes significantly contributes to discovering entire structures of genes of interest. Therefore, exploration of promoter regions is one of the most imperative topics in molecular genetics and biology. Besides experimental techniques, computational methods have been developed to predict promoters. In this study, we propose iPromoter-Seqvec - an efficient computational model to predict TATA and non-TATA promoters in human and mouse genomes using bidirectional long short-term memory neural networks in combination with sequence-embedded features extracted from input sequences. The promoter and non-promoter sequences were retrieved from the Eukaryotic Promoter database and then were refined to create four benchmark datasets.Entities:
Keywords: Bidirectional long short-term memory; DNA; Promoter; TATA-box; Transcription start site
Mesh:
Year: 2022 PMID: 36192696 PMCID: PMC9531353 DOI: 10.1186/s12864-022-08829-6
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 4.547
Model performance on the independent test sets of iPromoter-Seqvec and other state-of-the-art methods
| Dataset | Method | AUCROC | AUCPR | BA | SN | SP | PR | MCC | F1 |
|---|---|---|---|---|---|---|---|---|---|
| HS-TApro | iPro-EP | 0.89 | 0.87 | 0.81 | 0.84 | 0.78 | 0.79 | 0.62 | 0.81 |
| DeePromoter | - | - | 0.67 | 0.94 | 0.39 | 0.61 | 0.40 | 0.74 | |
| iPromoter-Seqvec (Ours) | 0.99 | 0.99 | 0.94 | 0.90 | 0.99 | 0.99 | 0.89 | 0.94 | |
| HS-nonTApro | iPro-EP | 0.73 | 0.74 | 0.65 | 0.73 | 0.56 | 0.63 | 0.30 | 0.67 |
| DeePromoter | - | - | 0.51 | 0.90 | 0.12 | 0.51 | 0.04 | 0.65 | |
| iPromoter-Seqvec (Ours) | 0.86 | 0.86 | 0.75 | 0.62 | 0.89 | 0.85 | 0.53 | 0.72 | |
| MM-TApro | DeePromoter | - | - | 0.59 | 0.84 | 0.34 | 0.56 | 0.21 | 0.67 |
| iPromoter-Seqvec (Ours) | 0.99 | 0.99 | 0.93 | 0.88 | 0.98 | 0.97 | 0.86 | 0.92 | |
| MM-nonTApro | DeePromoter | - | - | 0.64 | 0.87 | 0.40 | 0.59 | 0.31 | 0.71 |
| iPromoter-Seqvec (Ours) | 0.91 | 0.91 | 0.83 | 0.74 | 0.91 | 0.90 | 0.67 | 0.81 |
Fig. 1Steps in developing iPromoter-Seqvec
Datasets used for model training and evaluation
| Dataset | No. of sequences (Promoters: Non-promoters = 1: 1) | Total | ||
|---|---|---|---|---|
| Training | Validation | Test Set | ||
| HS-TATApro | 4958 | 400 | 500 | 5858 |
| HS-nonTATApro | 42800 | 4000 | 5000 | 51800 |
| MM-TATApro | 5272 | 400 | 500 | 6172 |
| MM-nonTATApro | 33892 | 4000 | 5000 | 42892 |
Fig. 2Construction of non-promoters (used in model training only) based on their corresponding promoters
Fig. 3Conversion of sequences to index vectors
Fig. 4Model architecture