| Literature DB >> 32770026 |
Sungjoon Park1, Yookyung Koh1, Hwisang Jeon2, Hyunjae Kim1, Yoonsun Yeo1, Jaewoo Kang3,4.
Abstract
Transcription factors (TFs) regulate the gene expression of their target genes by binding to the regulatory sequences of target genes (e.g., promoters and enhancers). To fully understand gene regulatory mechanisms, it is crucial to decipher the relationships between TFs and DNA sequences. Moreover, studies such as GWAS and eQTL have verified that most disease-related variants exist in non-coding regions, and highlighted the necessity to identify such variants that cause diseases by interrupting TF binding mechanisms. To do this, it is necessary to build a prediction model that precisely predicts the binding relationships between TFs and DNA sequences. Recently, deep learning based models have been proposed and have shown competitive results on a transcription factor binding site prediction task. However, it is difficult to interpret the prediction results obtained from the previous models. In addition, the previous models assumed all the sequence regions in the input DNA sequence have the same importance for predicting TF-binding, although sequence regions containing TF-binding-associated signals such as TF-binding motifs should be captured more than other regions. To address these challenges, we propose TBiNet, an attention based interpretable deep neural network for predicting transcription factor binding sites. Using the attention mechanism, our method is able to assign more importance on the actual TF binding sites in the input DNA sequence. TBiNet outperforms the current state-of-the-art methods (DeepSea and DanQ) quantitatively in the TF-DNA binding prediction task. Moreover, TBiNet is more effective than the previous models in discovering known TF-binding motifs.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32770026 PMCID: PMC7414127 DOI: 10.1038/s41598-020-70218-4
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Statistics of the ChIP-seq dataset.
| Dataset | # Samples | Ratio (%) |
|---|---|---|
| Training | 4,400,000 | 90.48 |
| Validation | 8,000 | 0.16 |
| Test | 455,024 | 9.36 |
| Total | 4,863,024 | 100.00 |
Notations of variables.
| Notation | Description | Value |
|---|---|---|
| Sequence length | 1,000 | |
| Kernel length | 26 | |
| Total number of kernels | 320 | |
| Maxpool window size | 13 | |
| Maxpool output size | 75 | |
| One-hot encoded input sequence | ||
| Filter in CNN layer | ||
| Output of CNN (Activation map) | ||
| Key vector | ||
| Attention vector | ||
| Scaled | ||
| Output of BiLSTM | ||
| Output of the fully connected layer | ||
| Output of TBiNet |
Figure 1An illustration of TBiNet. An input DNA sequence is represented as a one-hot encoded matrix where each row corresponds to a sequence position and each column corresponds to a DNA nucleotide (‘A’, ‘G’, ‘C’, or ‘T’). Kernels in the CNN layer act as motif scanners. matrix is the output of the convolution layer. is a randomly initialized vector for attention (key vector). In the attention layer, matrix–vector multiplication is applied between and . The output is then normalized by the softmax function which is referred to as an attention vector . Element-wise multiplication is applied between each column of and . Then the matrix is passed to BiLSTM layer. The output of the BiLSTM layer is flattened and fed into a fully connected layer. A sigmoid function is applied in the output layer and is the final output of the model which is a 690-dimensional vector where each dimension denotes a TF-cell line combination used for a ChIP-seq experiment.
Comparison of neural network architectures between TBiNet and baseline methods.
| Model | CNN | RNN | Attention |
|---|---|---|---|
| DeepSea | O | – | – |
| DanQ | O | O | – |
| TBiNet | O | O | O |
Quantitative evaluation of DeepSea, DanQ, and TBiNet: average AUROC and AUPR scores obtained in the 690 ChIP-seq experiments.
| Model | AUROC | AUPR |
|---|---|---|
| DeepSea | 0.9015 | 0.2485 |
| DanQ | 0.9316 | 0.2959 |
| TBiNet | 0.9473 | 0.3332 |
Figure 2Scatter plot of AUROC and AUPR scores of DanQ and TBiNet. (a) AUROC scores for 690 tasks. The AUROC scores of DanQ are plotted on the x-axis and the AUROC scores of TBiNet are plotted on the y-axis. (b) Scatter plot of AUPR scores for 690 tasks. The x-axis corresponds to the AUPR scores of DanQ and the y-axis corresponds to the AUPR scores of TBiNet. Blue and red marks indicate the samples with a higher or lower score than DanQ, respectively. Red circled points in (b) are for further analysis.
Figure 4Examples of TF binding motifs extracted by TBiNet. The bottom sequence logo is obtained from TBiNet and the top sequence logo is similar to the motif from the reference motif database. The name of the transcription factor and the matching significance value (E-value) are shown above each figure.
Figure 3Visualization of CNN kernels resulting in exceptionally high AUPR scores. (a) A motif of NRSF for one CNN kernel from DanQ. (b–f) five CNN kernels from TBiNet that capture the same motif of NRSF. (b, c) Represent the reverse complement of the motif sequence. The top and bottom logo indicates the actual and predicted motif of NRSF respectively.
Figure 5Dendrogram of clustered motifs by edit distance. x-axis indicates a motif index discovered by each model. y-axis indicates an edit distance. Red dotted lines are thresholds for removing redundant motifs.
The number of distinct motifs filtered by edit distance.
| Model | # of kernels | ||||
|---|---|---|---|---|---|
| DanQ | 66 | 61 ( | 58 ( | 40 ( | 320 |
| TBiNet | 142 | 142 (–) | 136 ( | 91 ( | 320 |
d indicates the edit distance between motif sequences.
Figure 6Visualization of attention scores. The top heatmap represents the averaged attention scores of the DNA sequences where none of TFs binds. The bottom heatmap is from the DNA sequences where at least one TF binds.