| Literature DB >> 35719916 |
Linan Cao1, Pei Liu1, Jialong Chen1, Lei Deng1.
Abstract
In the process of regulating gene expression and evolution, such as DNA replication and mRNA transcription, the binding of transcription factors (TFs) to TF binding sites (TFBS) plays a vital role. Precisely modeling the specificity of genes and searching for TFBS are helpful to explore the mechanism of cell expression. In recent years, computational and deep learning methods searching for TFBS have become an active field of research. However, existing methods generally cannot meet high performance and interpretability simultaneously. Here, we develop an accurate and interpretable attention-based hybrid approach, DeepARC, that combines a convolutional neural network (CNN) and recurrent neural network (RNN) to predict TFBS. DeepARC employs a positional embedding method to extract the hidden embedding from DNA sequences, including the positional information from OneHot encoding and the distributed embedding from DNA2Vec. DeepARC feeds the positional embedding of the DNA sequence into a CNN-BiLSTM-Attention-based framework to complete the task of finding the motif. Taking advantage of the attention mechanism, DeepARC can gain greater access to valuable information about the motif and bring interpretability to the work of searching for motifs through the attention weight graph. Moreover, DeepARC achieves promising performances with an average area under the receiver operating characteristic curve (AUC) score of 0.908 on five cell lines (A549, GM12878, Hep-G2, H1-hESC, and Hela) in the benchmark dataset. We also compare the positional embedding with OneHot and DNA2Vec and gain a competitive advantage.Entities:
Keywords: DNA; attention mechanism; deep learning; positional embedding; transcription factor binding sites
Year: 2022 PMID: 35719916 PMCID: PMC9204005 DOI: 10.3389/fonc.2022.893520
Source DB: PubMed Journal: Front Oncol ISSN: 2234-943X Impact factor: 5.738
Figure 1The architecture of DeepARC. In the embedding layer, OneHot and k-mer encoding are used to generate position-based feature embedding from DNA sequences. Then convolution kernels are utilized to extract non-linear features. In the BiLSTM layer, we use a bidirectional long short-term memory network (BiLSTM) to capture the contextual dependencies of DNA sequences. Next, we use the attention mechanism to enhance the model’s prediction performance, and finally, the prediction results are obtained through the dense layer.
Figure 2DNA positional embedding for an original DNA sequence with 101-bps length and 3-mer split.
Performance comparison of CNN-BiLSTM, BiLSTM-Attention, and CNN-BiLSTM-Att with OneHot embedding.
| Dataset | Model |
|
|
|
|
|
|---|---|---|---|---|---|---|
| CNN-BiLSTM-Att |
| 83.66 |
|
|
| |
| A549 | CNN-BiLSTM | 79.42 | 82.61 | 81.01 | 0.625 | 0.896 |
| BiLSTM-Att | 74.13 |
| 80.55 | 0.616 | 0.887 | |
| CNN-BiLSTM-Att |
| 83.02 |
|
|
| |
| GM12878 | CNN-BiLSTM | 75.63 |
| 81.07 | 0.625 | 0.891 |
| BiLSTM-Att | 73.90 | 85.58 | 79.74 | 0.600 | 0.880 | |
| CNN-BiLSTM-Att | 79.95 | 74.32 |
|
|
| |
| Hela | CNN-BiLSTM |
| 73.81 | 76.99 | 0.543 | 0.858 |
| BiLSTM-Att | 76.32 |
| 76.13 | 0.524 | 0.845 | |
| CNN-BiLSTM-Att | 81.47 | 84.50 |
|
|
| |
| Hep-G2 | CNN-BiLSTM |
| 77.99 | 81.93 | 0.641 | 0.906 |
| BiLSTM-Att | 76.78 |
| 81.48 | 0.634 | 0.897 | |
| CNN-BiLSTM-Att | 81.26 |
|
|
|
| |
| H1-hESC | CNN-BiLSTM |
| 81.31 | 81.25 | 0.629 | 0.883 |
| BiLSTM-Att | 76.11 | 81.52 | 82.32 | 0.612 | 0.876 |
CNN, convolutional neural network; BiLSTM, bidirectional long short-term memory network; Sen, sensitivity; Spe, specificity; Acc, accuracy; MCC, Mathew’s correlation coefficient; AUC, area under the receiver operating characteristic curve.
The bold section indicates the best performing indicators in each dataset.
Parameters setting of different models.
| CNN | CNN | - | |
|---|---|---|---|
| Parameter | BiLSTM | BiLSTM | BiLSTM |
| Att | – | Att | |
| Learning rate | 0.001 | 0.001 | 0.001 |
| Epochs | 20 | 20 | 20 |
| Batch size | 64 | 64 | 64 |
| CNN layers | 2 | 2 | – |
| Kernel size | 5 | 5 | – |
| BiLSTM hidden size | 16 | 16 | 32 |
| Attention vec size | 16 | – | 32 |
| Dense neurons | 16 | 32 | 32 |
| Dropout | 0.2 | 0.2 | 0.2 |
| Optimizer | Adam | Adam | Adam |
CNN, convolutional neural network; BiLSTM, bidirectional long short-term memory network.
Performance comparison of OneHot, DNA2Vec, and positional embedding.
| Dataset | Model |
|
|
|
|
|
|---|---|---|---|---|---|---|
| Positional embedding |
| 83.74 |
|
|
| |
| A549 | DNA2Vec | 77.58 |
| 81.61 | 0.635 | 0.896 |
| OneHot | 80.66 | 83.66 | 82.16 | 0.644 | 0.901 | |
| Positional embedding |
|
|
|
|
| |
| GM12878 | DNA2Vec | 80.81 | 81.25 | 81.03 | 0.623 | 0.895 |
| OneHot | 81.05 | 83.02 | 82.04 | 0.641 | 0.902 | |
| Positional embedding |
|
|
|
|
| |
| Hela | DNA2Vec | 77.53 | 75.48 | 76.51 | 0.534 | 0.853 |
| OneHot | 79.95 | 74.32 | 77.13 | 0.545 | 0.860 | |
| Positional embedding |
|
|
|
|
| |
| Hep-G2 | DNA2Vec | 80.91 | 85.14 | 83.25 | 0.661 | 0.908 |
| OneHot | 81.47 | 84.50 | 82.98 | 0.660 | 0.908 | |
| Positional embedding |
|
| 82.79 |
|
| |
| H1-hESC | DNA2Vec | 80.19 | 81.31 |
| 0.639 | 0.896 |
| OneHot | 81.26 | 82.13 | 82.72 | 0.636 | 0.891 |
Sen, sensitivity; Spe, specificity; Acc, accuracy; MCC, Mathew’s correlation coefficient; AUC, area under the receiver operating characteristic curve.
The bold section indicates the best performing indicators in each dataset.
Performance comparison of DeepARC and three existing predictors.
| Model |
|
|
|
|
|---|---|---|---|---|
| DeepARC |
|
|
|
|
| DeepTF | 77.44 | 81.36 | 80.98 | 0.632 |
| CNN-Zeng | 72.12 | 81.96 | 79.92 | 0.619 |
| DeepBind | 72.64 | 81.44 | 79.82 | 0.609 |
Sen, sensitivity; Spe, specificity; Acc, accuracy; MCC, Mathew’s correlation coefficient.
The bold part indicates the index with the best performance.
Figure 3Performance of DeepARC and three existing predictors in ROC-AUC. ROC, receiver operating characteristic; AUC, area under the ROC curve.
Figure 4Heatmap of H1-hESC.
Figure 5Heatmap of A5493.