| Literature DB >> 35242211 |
Yang Li1, Zheng Wang2, Zhu-Hong You3, Li-Ping Li4, Xuegang Hu1.
Abstract
Protein-protein interactions (PPIs) play a crucial role in understanding disease pathogenesis, genetic mechanisms, guiding drug design, and other biochemical processes, thus, the identification of PPIs is of great importance. With the rapid development of high-throughput sequencing technology, a large amount of PPIs sequence data has been accumulated. Researchers have designed many experimental methods to detect PPIs by using these sequence data, hence, the prediction of PPIs has become a research hotspot in proteomics. However, since traditional experimental methods are both time-consuming and costly, it is difficult to analyze and predict the massive amount of PPI data quickly and accurately. To address these issues, many computational systems employing machine learning knowledge were widely applied to PPIs prediction, thereby improving the overall recognition rate. In this paper, a novel and efficient computational technology is presented to implement a protein interaction prediction system using only protein sequence information. First, the Position-Specific Iterated Basic Local Alignment Search Tool (PSI-BLAST) was employed to generate a position-specific scoring matrix (PSSM) containing protein evolutionary information from the initial protein sequence. Second, we used a novel data processing feature representation scheme, MatFLDA, to extract the essential information of PSSM for protein sequences and obtained five training and five testing datasets by adopting a five-fold cross-validation method. Finally, the random fern (RFs) classifier was employed to infer the interactions among proteins, and a model called MatFLDA_RFs was developed. The proposed MatFLDA_RFs model achieved good prediction performance with 95.03% average accuracy on Yeast dataset and 85.35% average accuracy on H. pylori dataset, which effectively outperformed other existing computational methods. The experimental results indicate that the proposed method is capable of yielding better prediction results of PPIs, which provides an effective tool for the detection of new PPIs and the in-depth study of proteomics. Finally, we also developed a web server for the proposed model to predict protein-protein interactions, which is freely accessible online at http://120.77.11.78:5001/webserver/MatFLDA_RFs.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35242211 PMCID: PMC8888042 DOI: 10.1155/2022/7191684
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
Figure 1Flow chart of MatFLDA feature extraction for each protein pair.
Figure 2The flow of the proposed scheme.
Five-fold cross-validation prediction results achieved in predicting Yeast PPI dataset.
| Testing set | ACC (%) | PE (%) | SN (%) | MCC (%) | AUC (%) |
|---|---|---|---|---|---|
| 1 | 95.26 | 99.41 | 91.06 | 90.94 | 94.79 |
| 2 | 94.99 | 99.33 | 90.85 | 90.47 | 93.44 |
| 3 | 94.81 | 98.81 | 90.55 | 90.12 | 94.11 |
| 4 | 94.77 | 99.21 | 90.27 | 90.05 | 94.00 |
| 5 | 95.31 | 98.92 | 91.49 | 91.02 | 94.99 |
| Average | 95.03 ± 0.25 | 99.14 ± 0.26 | 90.84 ± 0.47 | 90.52 ± 0.45 | 94.27 ± 0.63 |
Five-fold cross-validation prediction results achieved in predicting H. pylori PPI dataset.
| Testing set | ACC (%) | PE (%) | SN (%) | MCC (%) | AUC (%) |
|---|---|---|---|---|---|
| 1 | 85.76 | 79.30 | 95.77 | 75.19 | 94.16 |
| 2 | 85.59 | 79.15 | 96.56 | 74.76 | 93.63 |
| 3 | 85.59 | 79.27 | 94.20 | 75.11 | 94.28 |
| 4 | 85.59 | 80.44 | 95.74 | 74.58 | 93.78 |
| 5 | 84.22 | 78.17 | 96.35 | 72.43 | 94.78 |
| Average | 85.35 ± 0.64 | 79.27 ± 0.81 | 95.72 ± 0.92 | 74.41 ± 1.14 | 94.12 ± 0.45 |
Figure 3ROC curves performed using the proposed method on Yeast dataset.
Figure 4ROC curves performed using the proposed method on H. pylori dataset.
Five-fold cross-validation results by using two models on the Yeast dataset.
| Classifier | Testing set | ACC (%) | PE (%) | SN (%) | MCC (%) | AUC (%) |
|---|---|---|---|---|---|---|
| SVM | 1 | 81.63 | 84.29 | 77.73 | 69.91 | 87.06 |
| 2 | 80.02 | 83.86 | 75.61 | 67.92 | 86.23 | |
| 3 | 79.44 | 80.79 | 76.39 | 67.25 | 84.55 | |
| 4 | 80.20 | 83.28 | 75.63 | 68.11 | 84.74 | |
| 5 | 80.69 | 82.83 | 76.83 | 68.72 | 86.34 | |
| Average | 80.39 ± 0.82 | 83.01 ± 1.36 | 76.44 ± 0.89 | 68.38 ± 1.00 | 85.78 ± 1.09 | |
| RFs | 1 | 95.26 | 99.41 | 91.06 | 90.94 | 94.79 |
| 2 | 94.99 | 99.33 | 90.85 | 90.47 | 93.44 | |
| 3 | 94.81 | 98.81 | 90.55 | 90.12 | 94.11 | |
| 4 | 94.77 | 99.21 | 90.27 | 90.05 | 94.00 | |
| 5 | 95.31 | 98.92 | 91.49 | 91.02 | 94.99 | |
| Average | 95.03 ± 0.25 | 99.14 ± 0.26 | 90.84 ± 0.47 | 90.52 ± 0.45 | 94.27 ± 0.63 | |
| Random Forest | Average | 95.48 ± 0.29 | 97.71 ± 0.38 | 93.14 ± 0.71 | 91.35 ± 0.53 | 95.48 ± 0.28 |
| XGBoost | Average | 94.08 ± 1.08 | 96.43 ± 0.92 | 91.54 ± 1.52 | 88.86 ± 1.91 | 98.59 ± 0.34 |
Five-fold cross-validation results by using two models on the H. pylori dataset.
| Classifier | Testing set | ACC (%) | PE (%) | SN (%) | MCC (%) | AUC (%) |
|---|---|---|---|---|---|---|
| SVM | 1 | 82.85 | 81.72 | 83.45 | 71.57 | 89.26 |
| 2 | 82.33 | 80.52 | 85.22 | 70.86 | 89.87 | |
| 3 | 79.42 | 76.17 | 82.25 | 67.25 | 86.20 | |
| 4 | 82.33 | 83.22 | 82.95 | 70.85 | 89.16 | |
| 5 | 83.53 | 84.75 | 83.06 | 72.47 | 90.22 | |
| Average | 82.09 ± 1.57 | 81.28 ± 3.26 | 83.39 ± 1.12 | 70.60 ± 1.99 | 88.94 ± 1.60 | |
| RFs | 1 | 85.76 | 79.30 | 95.77 | 75.19 | 94.16 |
| 2 | 85.59 | 79.15 | 96.56 | 74.76 | 93.63 | |
| 3 | 85.59 | 79.27 | 94.20 | 75.11 | 94.28 | |
| 4 | 85.59 | 80.44 | 95.74 | 74.58 | 93.78 | |
| 5 | 84.22 | 78.17 | 96.35 | 72.43 | 94.78 | |
| Average | 85.35 ± 0.64 | 79.27 ± 0.81 | 95.72 ± 0.92 | 74.41 ± 1.14 | 94.12 ± 0.45 | |
| Random Forest | Average | 87.27 ± 0.82 | 85.90 ± 0.72 | 89.09 ± 2.45 | 77.73 ± 1.21 | 93.28 ± 0.69 |
| XGBoost | Average | 85.11 ± 1.22 | 84.28 ± 3.10 | 86.49 ± 3.25 | 74.64 ± 1.72 | 91.59 ± 0.82 |
Figure 5ROC curves performed using the SVM method on Yeast dataset.
Figure 6ROC curves performed using the SVM method on H. pylori dataset.
The prediction ability of the other methods on the Yeast dataset.
| Related work | Method | ACC (%) | SN (%) | PE (%) | MCC (%) | AUC (%) |
|---|---|---|---|---|---|---|
| Guo et al.'s work [ | AC | 87.36 ± 1.38 | 87.30 ± 4.68 | 87.82 ± 4.33 | N/A | N/A |
| ACC | 89.33 ± 2.67 | 89.93 ± 3.68 | 88.87 ± 6.16 | N/A | N/A | |
| Yang et al.'s work [ | Cod4 + KNN | 86.15 ± 1.17 | 81.03 ± 1.74 | 90.24 ± 1.34 | N/A | N/A |
| Zhou et al.'s work [ | SVM + LD | 88.56 ± 0.33 | 87.37 ± 0.22 | 89.50 ± 0.60 | 77.15 ± 0.68 | 95.07 ± 0.39 |
| You et al.'s work [ | MCD + SVM | 91.36 ± 0.36 | 90.67 ± 0.69 | 91.94 ± 0.62 | 84.21 ± 0.59 | 97.07 ± 0.12 |
| You et al.'s work [ | LRA + RF | 94.14 ± 1.8 | 91.22 ± 1.6 | 97.10 ± 2.1 | 88.96 ± 2.6 | 94.20 ± 1.7 |
| Du et al.'s work [ | DeepPPI | 94.43 ± 0.30 | 92.06 ± 0.36 | 96.65 ± 0.59 | 88.97 ± 0.62 | N/A |
| Wong et al.'s work [ | PR − LPQ + RF | 93.92 ± 0.36 | 91.10 ± 0.31 | 96.45 ± 0.45 | 88.56 ± 0.63 | N/A |
| Proposed method | MatFLDA_RFs | 95.03 ± 0.25 | 90.84 ± 0.47 | 99.14 ± 0.26 | 90.52 ± 0.45 | 94.27 ± 0.63 |
Note: N/A means not available.
The prediction ability of the different methods on the H. pylori PPI dataset.
| Related work | Method | ACC (%) | SN (%) | PE (%) | MCC (%) |
|---|---|---|---|---|---|
| Martin et al.'s work [ | Signature products + SVM | 83.40 | 79.90 | 85.70 | N/A |
| You et al.'s work [ | MCD + SVM | 84.91 | 83.24 | 86.12 | 74.40 |
| Nanni's work [ | WSR | 83.70 | 79.00 | 87.00 | N/A |
| Bock and Gough's work [ | Phylogenetic Booststrap | 75.80 | 69.80 | 80.20 | N/A |
| Nanni's work [ | LDC | 83.00 | 80.60 | 85.10 | N/A |
| Shi et al.'s work [ | Boosting | 79.52 | 80.37 | 81.69 | 70.64 |
| Proposed method | MatFLDA_RFs | 85.35 | 95.72 | 79.27 | 74.41 |
Note: N/A means not available.