| Literature DB >> 31510705 |
Muhao Chen1, Chelsea J-T Ju1, Guangyu Zhou1, Xuelu Chen1, Tianran Zhang2, Kai-Wei Chang1, Carlo Zaniolo1, Wei Wang1.
Abstract
MOTIVATION: Sequence-based protein-protein interaction (PPI) prediction represents a fundamental computational biology problem. To address this problem, extensive research efforts have been made to extract predefined features from the sequences. Based on these features, statistical algorithms are learned to classify the PPIs. However, such explicit features are usually costly to extract, and typically have limited coverage on the PPI information.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31510705 PMCID: PMC6681469 DOI: 10.1093/bioinformatics/btz328
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.The overall learning architecture of our framework
Fig. 2.The structure of our residual RCNN encoder is shown on the right, and the RCNN unit is shown on the left. Each RCNN unit contains a convolution-pooling layer followed a bidirectional residual GRU
Evaluation of binary PPI prediction on the Yeast dataset based on 5-fold cross-validation. We report the mean and SD for the test sets
| Methods | Accuracy (%) | Precision (%) | Sensitivity (%) | Specificity (%) | F1-score (%) | MCC (%) |
|---|---|---|---|---|---|---|
| SVM-AC | 87.35 ± 1.38 | 87.82 ± 4.84 | 87.30 ± 5.23 | 87.41 ± 6.33 | 87.34 ± 1.33 | 75.09 ± 2.51 |
| kNN-CTD | 86.15 ± 1.17 | 90.24 ± 1.34 | 81.03 ± 1.74 | NA | 85.39 ± 1.51 | NA |
| EELM-PCA | 86.99 ± 0.29 | 87.59 ± 0.32 | 86.15 ± 0.43 | NA | 86.86 ± 0.37 | 77.36 ± 0.44 |
| SVM-MCD | 91.36 ± 0.4 | 91.94 ± 0.69 | 90.67 ± 0.77 | NA | 91.3 ± 0.73 | 84.21 ± 0.66 |
| MLP | 94.43 ± 0.3 | 96.65 ± 0.59 | 92.06 ± 0.36 | NA | 94.3 ± 0.45 | 88.97 ± 0.62 |
| RF-LPQ | 93.92 ± 0.36 | 96.45 ± 0.45 | 91.10 ± 0.31 | NA | 93.7 ± 0.37 | 88.56 ± 0.63 |
| SAE | 67.17 ± 0.62 | 66.90 ± 1.42 | 68.06 ± 2.50 | 66.30 ± 2.27 | 67.44 ± 1.08 | 34.39 ± 1.25 |
| DNN-PPI | 76.61 ± 0.51 | 75.1 ± 0.66 | 79.63 ± 1.34 | 73.59 ± 1.28 | 77.29 ± 0.66 | 53.32 ± 1.05 |
| DPPI | 94.55 | 96.68 | 92.24 | NA | 94.41 | NA |
| SRGRU | 93.77 ± 0.84 | 94.60 ± 0.64 | 92.85 ± 1.58 | 94.69 ± 0.81 | 93.71 ± 0.85 | 87.56 ± 1.67 |
| SCNN | 95.03 ± 0.47 | 95.51 ± 0.77 | 94.51 ± 1.27 | 95.55 ± 0.77 | 95.00 ± 0.50 | 90.08 ± 0.93 |
|
|
|
|
| 97.00 ± 0.67 |
|
|
Each boldfaced number indicates the best of the corresponding metric.
NA, not available from the original paper.
Run-time of training embeddings and different prediction tasks
| Task | Embeddings | Binary | Multi-class | Multi-class | Regression |
|---|---|---|---|---|---|
| Dataset | SHS148k | Yeast | SHS27k | SHS148k | SKEMPI |
| Sample size | 8000 | 11 188 | 26 945 | 148 051 | 2 950 |
| Training time | 8 s | 2.5 min | 15.8 min | 138.3 min | 12.5 min |
Statistical assessment (t-test; two-tailed) on the accuracy of binary PPI prediction
|
| SRGRU | SCNN |
|
|---|---|---|---|
| SVM-AC | 9.69E-05 | 1.22E-04 | 9.69E-05 |
| kNN-CTD | 1.03E-05 | 2.23E-05 | 2.84E-05 |
| EELM-PCA | 2.33E-05 | 3.94E-08 | 2.43E-10 |
| SVM-MCD | 1.67E-03 | 2.60E-06 | 1.35E-07 |
| MLP | 1.71E-01 | 5.29E-02 | 1.12E-06 |
| RF-LPQ | 7.28E-01 | 4.10E-03 | 1.75E-06 |
| SAE | 4.27E-10 | 1.78E-10 | 4.19E-09 |
| DNN-PPI | 1.62E-08 | 2.27E-10 | 2.70E-09 |
| SRGRU | NA | 2.87E-02 | 6.60E-04 |
| SCNN | 2.87E-02 | NA | 1.80E-04 |
Note: The statistically significant differences are highlighted in red.
NA, not available.
Evaluation of binary PPI prediction on variants of multi-species (C. elegans, D. melanogaster and E. coli) dataset
| Seq. identity | # of proteins | Pos. pairs | Neg. pairs | Accuracy (%) | F1-score (%) |
|---|---|---|---|---|---|
| Any | 11 529 | 32 959 | 32 959 | 98.19 | 98.17 |
| <0.40 | 9739 | 25 916 | 22 012 | 98.29 | 98.28 |
| <0.25 | 7790 | 19 458 | 15 827 | 97.91 | 98.08 |
| <0.10 | 5769 | 12 641 | 9819 | 97.54 | 97.79 |
| <0.01 | 5171 | 10 747 | 8065 | 97.51 | 97.80 |
Accuracy (%) and fold changes over zero rule for PPI interaction type prediction on two STRING datasets based on 10-fold cross-validation
| Features | N/A | AC | CTD | Embedded raw seqs | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Methods | Rand | Zero rule | SVM | RF | AdaBoost | kNN | Logistic | SVM | RF | AdaBoost | kNN | Logistic | SCNN | SRGRU |
|
| SHS27k | 14.28 | 16.70 | 33.17 | 44.82 | 28.67 | 35.44 | 25.47 | 35.56 | 45.76 | 31.81 | 35.56 | 30.57 | 55.54 | 51.06 |
|
| (fold×) | — | 1.00× | 1.99× | 2.68× | 1.72× | 2.12× | 1.52× | 2.13× | 2.74× | 1.90× | 2.13× | 1.83× | 3.33× | 3.06× |
|
| SHS148k | 14.28 | 16.21 | 28.17 | 36.01 | 27.87 | 33.81 | 24.96 | 31.37 | 36.65 | 29.67 | 33.13 | 26.96 | 55.29 | 54.05 |
|
| (fold×) | — | 1.00× | 1.74× | 2.22× | 1.72× | 2.09× | 1.54× | 1.94× | 2.26× | 1.83× | 2.04× | 1.66× | 3.41× | 3.33× |
|
Each boldfaced number indicates the best of the corresponding metric.
Results for binding affinity prediction on the SKEMPI dataset
| Features | AC | CTD | Embedded raw seqs | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Methods | BR | SVM | RF | AdaBoost | BR | SVM | RF | AdaBoost | SCNN | SRGRU |
|
|
| 1.70 | 2.20 | 1.77 | 1.98 | 1.86 | 1.84 | 1.49 | 1.84 | 0.87 | 0.95 |
|
|
| 9.56 | 11.81 | 9.81 | 11.15 | 10.20 | 11.04 | 9.06 | 10.69 | 6.49 | 7.08 |
|
|
| 0.564 | 0.353 | 0.546 | 0.451 | 0.501 | 0.501 | 0.640 | 0.508 | 0.831 | 0.812 |
|
Note: Each measurement is an average of the test sets over 10-fold cross-validation.
Each boldfaced number indicates the best of the corresponding metric.
Fig. 3.Mutation effects on structure and binding affinity. The blue entity is Subtilisin BPN’ precursor (Chain E), and the red entity is Chymotrypsin inhibitor (Chain I). The mutation is highlighted in yellow. The wild-type (1TM1) and mutant (1TO1) complexes are retrieved from PDB
Comparison of amino acid representations based on binary prediction
|
|
|
| One-hot | |
|---|---|---|---|---|
| Dimension | 12 | 5 | 7 | 20 |
| Accuracy |
| 96.67 | 96.03 | 96.11 |
| Precision |
| 96.35 | 95.91 | 96.34 |
| F1-score |
| 96.51 | 96.08 | 96.10 |