| Literature DB >> 35780196 |
Nan Zhao1, Maji Zhuo1, Kun Tian1, Xinqi Gong2,3,4.
Abstract
Predicting protein-protein interaction and non-interaction are two important different aspects of multi-body structure predictions, which provide vital information about protein function. Some computational methods have recently been developed to complement experimental methods, but still cannot effectively detect real non-interacting protein pairs. We proposed a gene sequence-based method, named NVDT (Natural Vector combine with Dinucleotide and Triplet nucleotide), for the prediction of interaction and non-interaction. For protein-protein non-interactions (PPNIs), the proposed method obtained accuracies of 86.23% for Homo sapiens and 85.34% for Mus musculus, and it performed well on three types of non-interaction networks. For protein-protein interactions (PPIs), we obtained accuracies of 99.20, 94.94, 98.56, 95.41, and 94.83% for Saccharomyces cerevisiae, Drosophila melanogaster, Helicobacter pylori, Homo sapiens, and Mus musculus, respectively. Furthermore, NVDT outperformed established sequence-based methods and demonstrated high prediction results for cross-species interactions. NVDT is expected to be an effective approach for predicting PPIs and PPNIs.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35780196 PMCID: PMC9250521 DOI: 10.1038/s42003-022-03617-0
Source DB: PubMed Journal: Commun Biol ISSN: 2399-3642
Prediction results by using two classifiers on H. sapiens and M. musculus datasets.
| Test set | Classifier | Acc. (%) | Pre. (%) | Sen. (%) | MCC (%) | F-score (%) | AUC |
|---|---|---|---|---|---|---|---|
| real dataset | SVM | 80.92 | 78.83 | 84.54 | 62.00 | 81.59 | 0.8707 |
| RF | 86.23 | 84.09 | 89.37 | 72.61 | 86.65 | 0.8623 | |
| constructed dataset | SVM | 95.41 | 91.59 | 100.00 | 91.21 | 95.61 | 0.9283 |
| RF | 95.41 | 91.59 | 100.00 | 91.21 | 95.61 | 0.9541 | |
| real dataset | SVM | 80.17 | 75.36 | 89.66 | 61.46 | 81.89 | 0.8249 |
| RF | 85.34 | 90.20 | 79.31 | 71.21 | 84.40 | 0.8534 | |
| constructed dataset | SVM | 94.83 | 96.43 | 93.10 | 89.71 | 94.74 | 0.9643 |
| RF | 94.83 | 100.00 | 89.66 | 90.14 | 94.55 | 0.9483 | |
Fig. 1PPNI network prediction results.
a A one-core network involving P62879. b A multiple-core network involving the Q8TBX8-O75175-P31150-Q16828-Q8TAU0-Q9H6S3 pathway. c A crossing network. The core and satellite proteins are represented by indigo blue circles and light blue circles, respectively. Dotted lines connecting two proteins are divided into four classes: gray, predicted correctly; red, predicted falsely; green, re-predicted correctly after adding 40% non-interactions, blue, re-predicted falsely after adding 40% non-interactions.
Five-fold cross-validation results on the constructed dataset.
| Test set | Acc. (%) | Pre. (%) | Sen. (%) | MCC (%) | F-score (%) | AUC |
|---|---|---|---|---|---|---|
| 95.40 ± 1.45 | 91.69 ± 2.39 | 99.90 ± 0.22 | 91.19 ± 2.68 | 95.61 ± 1.33 | 0.9647 ± 0.0360 | |
| 94.83 ± 2.69 | 96.43 ± 2.64 | 93.10 ± 3.92 | 89.71 ± 5.31 | 94.74 ± 2.77 | 0.9430 ± 0.0184 | |
| 98.28 ± 0.33 | 98.87 ± 0.47 | 97.68 ± 0.33 | 96.58 ± 0.67 | 98.27 ± 0.33 | 0.9963 ± 0.0009 | |
| 93.22 ± 1.53 | 95.33 ± 1.62 | 90.92 ± 3.03 | 86.57 ± 2.99 | 93.04 ± 1.65 | 0.9656 ± 0.0106 | |
| 94.63 ± 2.18 | 97.56 ± 1.64 | 91.56 ± 4.02 | 89.49 ± 4.19 | 94.43 ± 2.33 | 0.9822 ± 0.0088 |
Note: The values in the table are average ± standard deviation.
Fig. 2Prediction performance based on SVM with NV, NVD, and NVDT on S. cerevisiae, D. melanogaster and H. pylori constructed datasets.
The abscissa shows the prediction metrics and the ordinate shows the prediction performance.
Prediction performance based on distinct feature combination methods on H. sapiens and M. musculus datasets.
| Datasets | Cod1 | Cod2 | Cod3 | Cod4 | Cod5 |
|---|---|---|---|---|---|
| 79.95 | 88.89 | 90.82 | 89.37 | 92.51 | |
| 70.77 | 78.26 | 78.27 | 74.15 | 80.92 | |
| 86.21 | 89.66 | 90.52 | 88.79 | 94.83 | |
| 75.00 | 69.83 | 78.45 | 66.38 | 80.17 |
Note: The values in the table are the accuracy (%) of the independent test set.
Prediction results of different methods on three constructed independent test datasets.
| Method | Acc. (%) | Pre. (%) | Sen. (%) | MCC (%) | F-score (%) |
|---|---|---|---|---|---|
| Natural vector difference | 78.43 | 83.80 | 70.49 | 57.60 | 76.57 |
| Codon frequency difference | 91.90 | 90.38 | 93.78 | 83.86 | 92.05 |
| our method (SVM-NVDT) | 99.20 | 99.35 | 99.03 | 98.39 | 99.19 |
| Natural vector difference | 79.78 | 84.87 | 72.47 | 60.20 | 78.18 |
| Codon frequency difference | 89.04 | 89.71 | 88.20 | 78.10 | 88.95 |
| our method (SVM-NVDT) | 94.94 | 97.62 | 92.13 | 90.03 | 94.80 |
| Natural vector difference | 71.60 | 71.18 | 71.19 | 43.21 | 71.49 |
| Codon frequency difference | 72.84 | 79.37 | 61.73 | 46.85 | 69.44 |
| our method (SVM-NVDT) | 98.56 | 98.36 | 98.77 | 97.12 | 98.56 |
Note: Natural vector difference method[58]. Codon frequency difference[57].
Comparison of established methods using the S. cerevisiae dataset.
| Model | Acc. (%) | Pre. (%) | Sen. (%) | MCC (%) |
|---|---|---|---|---|
| ACC (Guo, et al., 2008)[ | 89.33 | 88.87 | 89.93 | N/A |
| AC (Guo, et al., 2008)[ | 87.36 | 87.82 | 87.30 | N/A |
| Cod1 (Yang, et al., 2010)[ | 75.08 | 74.75 | 75.81 | N/A |
| Cod2 (Yang, et al., 2010)[ | 80.04 | 82.17 | 76.77 | N/A |
| Cod3 (Yang, et al., 2010)[ | 80.41 | 81.66 | 78.14 | N/A |
| Cod4 (Yang, et al., 2010)[ | 86.15 | 90.24 | 81.03 | N/A |
| SVM+LD (Zhou, et al., 2011)[ | 88.56 | 89.50 | 87.37 | 77.15 |
| RF+PR+LPQ (Wong, et al., 2015)[ | 93.80 | 96.66 | 90.64 | 88.35 |
| PCVMZM (Wang, et al., 2017)[ | 94.48 | 93.92 | 95.13 | 89.58 |
| DeepPPI (Du, et al., 2017)[ | 94.43 | 96.65 | 92.06 | 88.97 |
| DPPI (Hashemifar, et al., 2018)[ | 94.55 | 96.68 | 92.24 | N/A |
| LightGBM (Chen, et al., 2019)[ | 95.07 | 97.82 | 92.21 | 90.30 |
| PIPR (Chen, et al., 2019)[ | 97.09 | 97.00 | 97.17 | 95.63 |
| StackPPI(Cheng, et al., 2020)[ | 94.64 | 96.33 | 92.81 | 89.34 |
| TAGPPI(Song, et al., 2022)[ | 97.81 | 98.10 | 98.26 | 95.63 |
| Our model (SVM-NVDT) | 99.20 | 99.35 | 99.03 | 98.39 |
Note: N/A means not available.
Comparison of existing methods using the H. pylori dataset.
| Model | Acc. (%) | Pre. (%) | Sen. (%) | MCC (%) |
|---|---|---|---|---|
| HKNN (Nanni, 2005)[ | 84.00 | 84.00 | 86.00 | N/A |
| Signature products (Martin, et al., 2005)[ | 83.40 | 85.70 | 79.90 | N/A |
| Ensemble of HKNN (Nanni and Lumini, 2006)[ | 86.60 | 85.00 | 86.70 | N/A |
| Boosting (Shi, et al., 2010)[ | 79.52 | 81.69 | 80.37 | 70.64 |
| Ensemble ELM (You, et al., 2013)[ | 87.50 | 86.15 | 88.95 | 78.13 |
| MCD-SVM (You, et al., 2014)[ | 84.91 | 86.12 | 83.24 | 74.40 |
| Phylogenetic bootstrap (Bock J R et al., 2015)[ | 75.80 | 80.20 | 69.80 | N/A |
| RF+PR+LPQ (Wong, et al., 2015)[ | 89.47 | 89.63 | 89.18 | 81.16 |
| PCVMZM (Wang, et al., 2017)[ | 91.25 | 90.06 | 92.05 | 84.04 |
| DeepPPI (Du, et al., 2017)[ | 86.23 | 84.32 | 89.44 | 72.63 |
| Weighted Skip-sequential (Goktepe and Kodaz, 2018)[ | 89.15 | 87.29 | 88.13 | 77.21 |
| LightGBM (Chen, et al., 2019)[ | 89.03 | 88.36 | 89.99 | 78.14 |
| StackPPI (Cheng, et al., 2020)[ | 89.72 | 90.37 | 87.93 | 78.59 |
| Our model (SVM-NVDT) | 98.56 | 98.36 | 98.77 | 97.12 |
Note: N/A means not available.
Prediction results for independent datasets.
| Test species | No. of Test pairs | Acc. (%) |
|---|---|---|
| 1070 | 99.35 | |
| 1458 | 94.51 | |
| 1045 | 98.56 | |
| 336 | 98.50 | |
| 2143 | 89.64 |
Fig. 3Workflow of our computational pipeline to predict protein-protein interaction and non-interaction.
It describes the whole research process, including dataset collection, feature extraction, feature-selective classifier and result analysis.