| Literature DB >> 33986992 |
Guobin Li1, Xiuquan Du2, Xinlu Li1, Le Zou1, Guanhong Zhang1, Zhize Wu1.
Abstract
DNA-binding proteins (DBPs) play pivotal roles in many biological functions such as alternative splicing, RNA editing, and methylation. Many traditional machine learning (ML) methods and deep learning (DL) methods have been proposed to predict DBPs. However, these methods either rely on manual feature extraction or fail to capture long-term dependencies in the DNA sequence. In this paper, we propose a method, called PDBP-Fusion, to identify DBPs based on the fusion of local features and long-term dependencies only from primary sequences. We utilize convolutional neural network (CNN) to learn local features and use bi-directional long-short term memory network (Bi-LSTM) to capture critical long-term dependencies in context. Besides, we perform feature extraction, model training, and model prediction simultaneously. The PDBP-Fusion approach can predict DBPs with 86.45% sensitivity, 79.13% specificity, 82.81% accuracy, and 0.661 MCC on the PDB14189 benchmark dataset. The MCC of our proposed methods has been increased by at least 9.1% compared to other advanced prediction models. Moreover, the PDBP-Fusion also gets superior performance and model robustness on the PDB2272 independent dataset. It demonstrates that the PDBP-Fusion can be used to predict DBPs from sequences accurately and effectively; the online server is at http://119.45.144.26:8080/PDBP-Fusion/. ©2021 Li et al.Entities:
Keywords: Convolution neural network (CNN); DNA binding protein prediction; Deep learning; Fusion approach; Long short-term memory network (LSTM); Long-term dependence
Year: 2021 PMID: 33986992 PMCID: PMC8101451 DOI: 10.7717/peerj.11262
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1Architecture of the proposed PDBP-Fusion model.
Figure 2Statistical graph of DNA sequence length distribution in the PDB14189 dataset.
Figure 3Coding diagram of (A) One-hot encoding and (B) word embedding encoding.
Amino acid encoder.
| A∼Z (except B, J, O, U, X, Z) | A:1, C:2, D:3, E:4, F:5, G:6, H:7, I:8, K:9, L:10, M:11, N:12, P:13, Q:14, R:15, S:16, T:17, V:18, W:19, Y:20 |
| B, J, O, U, X, Z:0 |
Figure 4CNN network structure.
Figure 5Model evaluation on benchmark datasets PDB14189.
Parameters details of the proposed models.
| 1 | One-hot encoding | Len*20 | One-hot encoding | Len *20 |
| 2 | Convolution1 (kernel=7, stride=1) | Len *64 | Convolution1 (kernel=9, stride=1) | Len *64 |
| 3 | Max-polling1 (kernel=2) | (Len / 2) *64 | Max-polling1 (kernel=2) | (Len / 2) *64 |
| 4 | Convolution2 (kernel=7, stride=1) | (Len / 2) *64 | Convolution2 (kernel=9, stride=1) | (Len / 2) *64 |
| 5 | Max-polling2 (kernel=2) | (Len / 4) *64 | Max-polling2 (kernel=2) | (Len / 4) *64 |
| 6 | Convolution3 (kernel=7, stride=1) | (Len / 2) *64 | Bi- LSTM(32) | 150*64 |
| 7 | Max-polling3 (kernel=2) | (Len / 8) *64 | Dense(128) | 128 |
| 8 | Dense (128) | 128 | Dense(2) | 2 |
| 9 | Dense (2) | 2 |
Notes.
“Len” denotes the input sequence max length.
Quantitative results of the PDB-CNN method with different maximum sequence lengths.
| 100 | 76.51 | 73.83 | 83.53 | 69.43 | 54.05 | 85.26 |
| 150 | 78.85 | 76.92 | 83.29 | 74.37 | 58.27 | 87.21 |
| 200 | 79.48 | 77.9 | 83.28 | 75.64 | 59.55 | 88.02 |
| 250 | 80.38 | 78.61 | 84.17 | 76.56 | 61.21 | 88.71 |
| 300 | 80.72 | 79.42 | 83.70 | 77.71 | 61.90 | 89.06 |
| 350 | 81.04 | 77.95 | 87.32 | 74.71 | 62.86 | 89.42 |
| 400 | 81.28 | 78.34 | 87.39 | 75.11 | 63.40 | 89.7 |
| 450 | 81.52 | 79.21 | 86.20 | 76.80 | 63.59 | 89.94 |
| 500 | 81.55 | 78.72 | 87.36 | 75.68 | 63.89 | 90.00 |
| 550 | 81.32 | 78.52 | 87.21 | 75.38 | 63.46 | 89.86 |
| 600 | 81.88 | 80.64 | 84.67 | 79.05 | 64.21 | 90.18 |
| 650 | 81.74 | 80.58 | 84.49 | 78.97 | 63.97 | 90.12 |
| 700 | 81.79 | 78.22 | 88.83 | 74.68 | 64.42 | 90.13 |
| 750 | 81.68 | 79.04 | 87.07 | 76.24 | 64.08 | 89.92 |
| 800 | 81.71 | 78.33 | 88.44 | 74.92 | 64.28 | 90.11 |
| 850 | 82.02 | 79.25 | 87.49 | 76.50 | 64.69 | 90.29 |
| 900 | 81.94 | 79.09 | 87.51 | 76.32 | 64.53 | 90.13 |
| 950 | 81.91 | 79.50 | 86.74 | 77.03 | 64.44 | 90.24 |
| 1000 | 82.04 | 80.33 | 85.40 | 78.65 | 64.43 | 90.05 |
| >1000 | – | – | – | – | – | – |
Quantitative results of the PDBP-Fusion method with different maximum sequence lengths.
| 100 | 77.28 | 74.53 | 83.99 | 70.50 | 55.42 | 85.79 |
| 150 | 78.91 | 76.62 | 84.24 | 73.53 | 58.57 | 87.36 |
| 200 | 80.08 | 77.88 | 84.77 | 75.35 | 60.71 | 88.33 |
| 250 | 80.74 | 79.51 | 83.56 | 77.89 | 61.9 | 88.96 |
| 300 | 81.44 | 79.57 | 85.25 | 77.60 | 63.33 | 89.50 |
| 350 | 82.27 | 80.39 | 85.8 | 78.71 | 64.83 | 90.01 |
| 400 | 81.70 | 79.48 | 86.38 | 76.98 | 64.12 | 90.11 |
| 450 | 82.30 | 79.36 | 76.71 | 65.15 | 90.37 | |
| 500 | 82.50 | 79.58 | 87.95 | 77.01 | 65.54 | 90.43 |
| 550 | 82.16 | 79.97 | 86.56 | 77.72 | 64.90 | 90.37 |
| 600 | 82.56 | 80.87 | 85.90 | 65.50 | 90.61 | |
| 650 | 82.81 | 80.84 | 86.56 | 79.03 | 66.02 | 90.7 |
| 700 | 86.45 | 79.13 | ||||
| 750 | 82.66 | 80.51 | 86.82 | 78.45 | 65.8 | 90.65 |
| 800 | 82.7 | 79.99 | 87.8 | 77.56 | 65.95 | 90.74 |
| 850 | 82.71 | 80.51 | 86.93 | 78.44 | 65.89 | 90.73 |
| 900 | 82.63 | 80.06 | 87.61 | 77.61 | 65.85 | 90.69 |
| >900 | – | – | – | – | – | – |
PDBP-Fusion model performance using a word embedding encoding on the PDB14189 dataset.
| PDBP-Fusion | 81.01 | 78.48 | 81.58 | 62.0 | 89.03 |
| PDBP-Fusion | 79.40 | 83.60 | 75.15 | 59.1 | 87.81 |
Notes.
PDBP-Fusion model: (length = 800, word embedding encoding, 64 convolution kernels).
PDBP-Fusion model: (length = 800, word embedding encoding, 32 convolution kernels).
Figure 6MCCs and AUCs of the top three proposed models.
Figure 7MCCs and AUCs of models with different dropout ratios (violin plot).
Figure 8MCCs and AUCs of models with different dropout rates (box plot).
Peak performance of PDBP-CNN and PDBP-Fusion models on the PDB14189 dataset.
| PDBP-CNN | 82.02 ± 1.22 | 87.49 ± 4.12 | 76.50 ± 5.66 | 64.69 ± 1.87 | 90.29 ± 0.51 |
| PDBP-Fusion | 82.81 ± 1.30 | 86.45 ± 4.59 | 79.13 ± 5.81 | 66.1 ± 2.04 | 90.83 ± 0.57 |
Notes.
PDBP-CNN model: The maximum length is 850. The convolution layer has three layers, the convolution kernel is (7*1), the maximum pooling size is (2,1), and the total connection layer has 128 nodes. The dropout rate is set to 0.2.
PDBP-Fusion model: The maximum length is 700. The convolution layer has two layers, the convolution kernel is (9*1), the maximum pooling size is (2,1), the number of cells in Bi-LSTM is set to 16*2, and the total connection layer has 128 nodes. The dropout rate is set to 0.3.
Figure 9Overview of the StackDPPred prediction framework based on One-hot encoding.
Comparison of the proposed model with other methods on the PDB14189 dataset.
| MsDBP | 80.29 | 80.87 | 79.72 | 60.61 | 88.31 |
| PSSM | 79.62 | 76.02 | 83.21 | 59.4 | – |
| PSSM-PP | 81.69 | 78.92 | 84.45 | 63.5 | – |
| PHY | 77.65 | 73.54 | 81.76 | 55.5 | – |
| PSSM-PP+BP_NBP | 81.01 | – | |||
| PSSM-PP+PHY | 82.67 | 79.95 | 85.39 | 65.4 | – |
| BP ± NBP+PHY | 80.40 | 76.88 | 83.92 | 60.9 | – |
| ALL features | 82.23 | – | |||
| 64 Optimal features | 83.76 | – | |||
| StackDPPred(One-hot) | 76.00 | 79.27 | 72.71 | 52.10 | 83.18 |
| PDBP-CNN | 82.02 | 76.50 | 64.69 | 90.29 | |
| PDBP-Fusion | 86.45 |
Notes.
DNABP method which using RF classifier and various features (Ma, Guo & Sun, 2016).
StackDPPred(One-hot) method using StackDPPred and One-hot encoding (Mishra, Pokhrel & Hoque, 2019).
Comparison of various machine learning methods on the PDB2272 dataset.
| Qu et al. ( | 48.33 | 49.07 | 48.31 | −3.34 | 47.76 |
| Local-DPP ( | 50.57 | 58.72 | 8.76 | 4.56 | – |
| PseDNA-Pro ( | 61.88 | 59.90 | 75.28 | 24.30 | – |
| DPP-PseAAC ( | 58.10 | 59.10 | 56.63 | 16.25 | 61.00 |
| MsDBP ( | 66.99 | 66.42 | 70.69 | 33.97 | 73.83 |
| PDBP-Fusion | 66.85 |
Figure 10Index page of the web server.