| Literature DB >> 35521556 |
Feifei Cui1,2,3, Shuang Li4, Zilong Zhang1,2,3, Miaomiao Sui5, Chen Cao6, Abd El-Latif Hesham7, Quan Zou2,3.
Abstract
Nucleic acid-binding proteins (NABPs), including DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs), play vital roles in gene expression. Accurate identification of these proteins is crucial. However, there are two existing challenges: one is the problem of ignoring DNA- and RNA-binding proteins (DRBPs), and the other is a cross-predicting problem referring to DBP predictors predicting DBPs as RBPs, and vice versa. In this study, we proposed a computational predictor, called DeepMC-iNABP, with the goal of solving these difficulties by utilizing a multiclass classification strategy and deep learning approaches. DBPs, RBPs, DRBPs and non-NABPs as separate classes of data were used for training the DeepMC-iNABP model. The results on test data collected in this study and two independent test datasets showed that DeepMC-iNABP has a strong advantage in identifying the DRBPs and has the ability to alleviate the cross-prediction problem to a certain extent. The web-server of DeepMC-iNABP is freely available at http://www.deepmc-inabp.net/. The datasets used in this research can also be downloaded from the website.Entities:
Keywords: DNA-binding protein; Deep learning; Multiclass classification; Nucleic acid-binding protein; RNA-binding protein
Year: 2022 PMID: 35521556 PMCID: PMC9065708 DOI: 10.1016/j.csbj.2022.04.029
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 6.155
Data collected in this study for training, validation and testing.
| Classes | Train data | Validation data | Test data | In total |
|---|---|---|---|---|
| DBP chains | 9,720 | 1,080 | 1,200 | 12,000 |
| RBP chains | 9,720 | 1,080 | 1,200 | 12,000 |
| DRBP chains | 972 | 108 | 120 | 1,200 |
| Non-NABP chains | 9,720 | 1,080 | 1,200 | 12,000 |
| In total | 30,132 | 3,348 | 3,720 | 37,200 |
Abbreviations: DBP, DNA-binding protein; RBP, RNA-binding protein; DRBP, DNA- and RNA- binding protein; non-NABP, non-nucleic acid-binding protein.
Independent test datasets.
| Test datasets | DBPs | RBPs | DRBPs | non-NABP | In total |
|---|---|---|---|---|---|
| TEST474 | 175 | 68 | 8 | 233 | 474 |
| DRBP206 | 103 | 0 | 0 | 103 | 206 |
Fig. 1Fundamental architecture of the DeepMC-iNABP model.
Performance evaluation of test data collected in this study.
| Precision | Recall | F1-score | |
|---|---|---|---|
| DBP class | 0.889 | 0.707 | 0.787 |
| RBP class | 0.804 | 0.819 | 0.812 |
| DRBP class | 0.926 | 0. 625 | 0.746 |
| Non-NABP class | 0.700 | 0.853 | 0.769 |
| Average value | 0.835 | 0.725 | 0.759 |
Abbreviations: DBP, DNA-binding protein; RBP, RNA-binding protein; DRBP, DNA- and RNA- binding protein; non-NABP, non-nucleic acid-binding protein.
Fig. 2Performance of the DeepMC-iNABP model on the test dataset and independent test datasets. A and B. Confusion matrix and ROC-AUC curves of our model on the test dataset collected in this study, C and D. ROC-AUC curves of our model on independent test datasets (TEST474 and DRBP206). Class 0–3 in ROC-AUC curves refer to non-NABPs, DBPs, RBPs and DRBPs, respectively.
Comparison of DeepMC-iNABP and existing models on the independent dataset TEST474.
| Model | DNA-binding | RNA-binding | DNA- and RNA-binding | non-NABP | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Recall | Precision | F1 | Recall | Precision | F1 | Recall | Precision | F1 | Recall | Precision | F1 | |
| DeepDRBP-2L | 0.817 | 0.877 | 0.846 | 0.456 | 0.620 | 0.525 | 0.125 | 0.043 | 0.065 | 0.888 | 0.832 | 0.859 |
| iDRBP_MMC | 0.869 | 0.950 | 0.907 | 0.706 | 0.727 | 0.716 | 0.125 | 0.333 | 0.182 | 0.933 | 0.849 | 0.889 |
| DeepMC-iNABP | 0.834 | 0.869 | 0.851 | 0.588 | 0.714 | 0.645 | 0.892 | 0.812 | 0.850 | |||
The results were obtained using the webserver of DeepDRBP-2L [34].
The results were obtained using the webserver of iDRBP_MMC [32].
Fig. 3Confusion matrix of DeepMC-iNABP and existing models on the independent dataset TEST474.
Fig. 4Comparison of DeepMC-iNABP and existing models on the independent dataset. A and C. Independent dataset TEST474, B. Independent dataset DRBP206.
Fig. 5Feature visualization of DeepMC-iNABP by t-SNE for dimension reduction. A. Feature representation of test dataset collected in this study, B. Feature representation of independent dataset TEST474, C. Feature representation of independent dataset DRBP206.