| Literature DB >> 31726752 |
Zhan-Heng Chen1,2, Zhu-Hong You1,2, Wen-Bo Zhang1,2, Yan-Bin Wang1, Li Cheng1,2, Daniyal Alghazzawi3.
Abstract
Self-interacting proteins (SIPs) is of paramount importance in current molecular biology. There have been developed a number of traditional biological experiment methods for predicting SIPs in the past few years. However, these methods are costly, time-consuming and inefficient, and often limit their usage for predicting SIPs. Therefore, the development of computational method emerges at the times require. In this paper, we for the first time proposed a novel deep learning model which combined natural language processing (NLP) method for potential SIPs prediction from the protein sequence information. More specifically, the protein sequence is de novo assembled by k-mers. Then, we obtained the global vectors representation for each protein sequences by using natural language processing (NLP) technique. Finally, based on the knowledge of known self-interacting and non-interacting proteins, a multi-grained cascade forest model is trained to predict SIPs. Comprehensive experiments were performed on yeast and human datasets, which obtained an accuracy rate of 91.45% and 93.12%, respectively. From our evaluations, the experimental results show that the use of amino acid semantics information is very helpful for addressing the problem of sequences containing both self-interacting and non-interacting pairs of proteins. This work would have potential applications for various biological classification problems.Entities:
Keywords: de novo protein sequence; global vector representation; multi-grained cascade forest; self-interacting proteins
Mesh:
Substances:
Year: 2019 PMID: 31726752 PMCID: PMC6896115 DOI: 10.3390/genes10110924
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1De novo assembled protein sequences by 3-mer.
Figure 2Process of multi-grained scanning.
Figure 3Cascade forest model.
Confusion matrix. TN: true negative, FN: false negative, FP: false positive, TP: true positive.
| Predict | |||
|---|---|---|---|
| Negative | Positive | ||
| Actual | Negative | TN | FN |
| Positive | FP | TP | |
Performance of our proposed model on the two benchmark datasets. Acc: Accuracy; TNR: True negative rate; F1-score: Measuring the overall performance of the classification model; MCC: Matthews correlation.
| Datasets | Acc (%) | TNR (%) | F1-Score (%) | MCC |
|---|---|---|---|---|
|
| 91.45 | 99.71 | 37.56 | 0.4389 |
|
| 93.12 | 99.57 | 39.10 | 0.4421 |
Figure 4The receiver operating characteristic (ROC) curve of proposed model on yeast dataset.
Figure 5The ROC curve of proposed model on human dataset.
Performance of our proposed model and other previous methods on yeast dataset. AUC: Area under curve.
| Model | Acc (%) | TNR (%) | F1-Score (%) | MCC | AUC |
|---|---|---|---|---|---|
| SLIPPER [ | 71.90 | 72.18 | 36.16 | 0.2842 | 0.7723 |
| DXECPPI [ | 87.46 | 94.93 | 34.89 | 0.2825 | 0.6934 |
| PPIevo [ | 66.28 | 87.46 | 28.92 | 0.1801 | 0.6728 |
| LocFuse [ | 66.66 | 68.10 | 27.53 | 0.1577 | 0.7087 |
| CRS [ | 72.69 | 74.37 | 33.05 | 0.2368 | 0.7115 |
| SPAR [ | 76.96 | 80.02 | 34.54 | 0.2484 | 0.7455 |
|
|
|
|
|
|
|
Performance of our proposed model and other previous methods on human dataset.
| Model | Acc (%) | TNR (%) | F1-score (%) | MCC | AUC |
|---|---|---|---|---|---|
| SLIPPER [ | 91.10 | 95.06 |
| 0.4197 |
|
| DXECPPI [ | 30.90 | 25.83 | 17.28 | 0.0825 | 0.5806 |
| PPIevo [ | 78.04 | 25.82 | 27.73 | 0.2082 | 0.7329 |
| LocFuse [ | 80.66 | 80.50 | 27.65 | 0.2026 | 0.7087 |
| CRS [ | 91.54 | 96.72 | 36.83 | 0.3633 | 0.8196 |
| SPAR [ | 92.09 | 97.40 | 41.13 | 0.3836 | 0.8229 |
|
|
|
| 39.10 |
| 0.8524 |