| Literature DB >> 27669239 |
Yijie Ding1, Jijun Tang2,3, Fei Guo4.
Abstract
Identification of protein-protein interactions (PPIs) is a difficult and important problem in biology. Since experimental methods for predicting PPIs are both expensive and time-consuming, many computational methods have been developed to predict PPIs and interaction networks, which can be used to complement experimental approaches. However, these methods have limitations to overcome. They need a large number of homology proteins or literature to be applied in their method. In this paper, we propose a novel matrix-based protein sequence representation approach to predict PPIs, using an ensemble learning method for classification. We construct the matrix of Amino Acid Contact (AAC), based on the statistical analysis of residue-pairing frequencies in a database of 6323 protein-protein complexes. We first represent the protein sequence as a Substitution Matrix Representation (SMR) matrix. Then, the feature vector is extracted by applying algorithms of Histogram of Oriented Gradient (HOG) and Singular Value Decomposition (SVD) on the SMR matrix. Finally, we feed the feature vector into a Random Forest (RF) for judging interaction pairs and non-interaction pairs. Our method is applied to several PPI datasets to evaluate its performance. On the S . c e r e v i s i a e dataset, our method achieves 94 . 83 % accuracy and 92 . 40 % sensitivity. Compared with existing methods, and the accuracy of our method is increased by 0 . 11 percentage points. On the H . p y l o r i dataset, our method achieves 89 . 06 % accuracy and 88 . 15 % sensitivity, the accuracy of our method is increased by 0 . 76 % . On the H u m a n PPI dataset, our method achieves 97 . 60 % accuracy and 96 . 37 % sensitivity, and the accuracy of our method is increased by 1 . 30 % . In addition, we test our method on a very important PPI network, and it achieves 92 . 71 % accuracy. In the Wnt-related network, the accuracy of our method is increased by 16 . 67 % . The source code and all datasets are available at https://figshare.com/s/580c11dce13e63cb9a53.Entities:
Keywords: amino acid contact; feature extraction; protein sequence; protein–protein interactions; substitution matrix representation
Mesh:
Substances:
Year: 2016 PMID: 27669239 PMCID: PMC5085656 DOI: 10.3390/ijms17101623
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Analyze the performance of the Histogram of Oriented Gradient (HOG) and Singular Value Decomposition (SVD) on dataset by Random Forest (RF) classifier.
| Feature | Classifier | ACC (%) | SN (%) | Spec (%) | PPV (%) | NPV (%) | F1 (%) | MCC (%) |
|---|---|---|---|---|---|---|---|---|
| HOG | RF | 93.86 ± 0.47 | 90.67 ± 0.47 | 97.05 ± 0.74 | 96.86±0.72 | 91.22 ± 0.64 | 93.66 ± 0.40 | 87.90 ± 0.94 |
| SVD | RF | 92.93 ± 0.52 | 90.25 ± 0.70 | 95.59 ± 1.36 | 95.38 ± 1.22 | 90.76 ± 0.42 | 92.74 ± 0.47 | 85.99 ± 1.10 |
| HOG + SVD | RF | 94.83 ± 0.26 | 92.40 ± 0.50 | 97.26 ± 0.31 | 97.10 ± 0.35 | 92.79 ± 0.59 | 94.69 ± 0.24 | 89.77 ± 0.50 |
Five-fold cross validation result obtained by using our proposed method on the dataset.
| Testing Set | ACC (%) | SN (%) | Spec (%) | PPV (%) | NPV (%) | F1 (%) | MCC (%) |
|---|---|---|---|---|---|---|---|
| 1 | 94.73 | 92.70 | 96.72 | 96.53 | 93.08 | 94.58 | 89.52 |
| 2 | 95.13 | 92.80 | 97.31 | 97.01 | 93.51 | 94.86 | 90.31 |
| 3 | 95.04 | 92.67 | 97.47 | 97.40 | 92.84 | 94.98 | 90.19 |
| 4 | 94.81 | 92.24 | 97.40 | 97.27 | 92.59 | 94.69 | 89.75 |
| 5 | 94.46 | 91.60 | 97.39 | 97.28 | 91.91 | 94.35 | 89.09 |
| Average | 94.83 ± 0.26 | 92.40 ± 0.50 | 97.26 ± 0.31 | 97.10 ± 0.35 | 92.79 ± 0.59 | 94.69 ± 0.24 | 89.77 ± 0.50 |
Comparison of the prediction performance between our proposed method and other state-of-the-art works on the dataset. N/A means not available.
| Method | Feature | Classifier | ACC (%) | SN (%) | PPV (%) | MCC (%) |
|---|---|---|---|---|---|---|
| Our method | HOG + SVD | RF | 94.83 ± 0.26 | 92.40 ± 0.50 | 97.10 ± 0.35 | 89.77 ± 0.50 |
| You’s work [ | MLD | RF | 94.72 ± 0.43 | 94.34 ± 0.49 | 98.91 ± 0.33 | 85.99 ± 0.89 |
| You’s work [ | AC + CT + LD + MAC | E-ELM | 87.00 ± 0.29 | 86.15 ± 0.43 | 87.59 ± 0.32 | 77.36 ± 0.44 |
| You’s work [ | MCD | SVM | 91.36 ± 0.36 | 90.67 ± 0.69 | 91.94 ± 0.62 | 84.21 ± 0.59 |
| Wong’s work [ | PR-LPQ | Rotation Forest | 93.92 ± 0.36 | 91.10 ± 0.31 | 96.45 ± 0.45 | 88.56 ± 0.63 |
| Guo’s work [ | ACC | SVM | 89.33 ± 2.67 | 89.93 ± 3.68 | 88.87 ± 6.16 | N/A |
| Guo’s work [ | AC | SVM | 87.36 ± 1.38 | 87.30 ± 4.68 | 87.82 ± 4.33 | N/A |
| Zhou’s work [ | LD | SVM | 88.56 ± 0.33 | 87.37 ± 0.22 | 89.50 ± 0.60 | 77.15 ± 0.68 |
| Yang’s work [ | LD | KNN | 86.15 ± 1.17 | 81.03 ± 1.74 | 90.24 ± 1.34 | N/A |
* The feature representation of protein-protein interaction include the Histogram of Oriented Gradient (HOG), Singular Value Decomposition (SVD), Multi-scale Local Descriptor (MLD), Auto-Correlation (AC), Conjoint Triads (CT), Local Descriptors (LD), Moran autocorrelation (MAC), Multi-scale Continuous and Discontinuous (MCD), Local Phase Quantization descriptor (PR-LPQ) and Auto Cross Covariance (ACC). The classifiers include the Random Forest (RF), Ensemble Extreme Learning Machine (E-ELM), Support Vector Machine (SVM) and K-Nearest Neighbor (KNN).
Comparison of the prediction performance between our proposed method and other methods on the dataset. N/A means not available.
| Methods | ACC (%) | SN (%) | PPV (%) | MCC (%) |
|---|---|---|---|---|
| Our method | 89.06 | 88.15 | 89.79 | 78.15 |
| You’s work (MLD) [ | 88.30 | 92.47 | 85.99 | 79.19 |
| You’s work (AC + CT + LD + MAC) [ | 87.50 | 88.95 | 86.15 | 78.13 |
| You’s work (MCD) [ | 84.91 | 83.24 | 86.12 | 74.40 |
| Huang’s work (DCT + SMR) [ | 86.74 | 86.43 | 87.01 | 76.99 |
| Zhou’s work [ | 84.20 | 85.10 | 83.30 | N/A |
| Phylogenetic bootstrap [ | 75.80 | 69.80 | 80.20 | N/A |
| HKNN [ | 84.00 | 86.00 | 84.00 | N/A |
| Signature products [ | 83.40 | 79.90 | 85.70 | N/A |
| Ensemble of HKNN [ | 86.60 | 86.70 | 85.00 | N/A |
| Boosting | 79.52 | 80.37 | 81.69 | 70.64 |
Comparison of the prediction performance between our proposed method and other methods on the dataset.
| Methods | ACC (%) | SN (%) | PPV (%) | MCC (%) |
|---|---|---|---|---|
| Our method | 97.60 | 96.37 | 98.59 | 95.21 |
| Huang’s work (DCT + SMR) [ | 96.30 | 92.63 | 99.59 | 92.82 |
Prediction results on five independent species by our proposed method, based on the dataset as the training set. N/A means not available.
| Species | Testing Pairs | ACC(%) | |||
|---|---|---|---|---|---|
| Our Method | You’s Work [ | Huang’s Work [ | Zhou’s Work [ | ||
| 6954 | 93.18 | 89.30 | 66.08 | 71.24 | |
| 4013 | 90.28 | 87.71 | 81.19 | 75.73 | |
| 1412 | 94.58 | 94.19 | 82.22 | 76.27 | |
| 1420 | 92.03 | 90.99 | 82.18 | N/A | |
| 313 | 92.25 | 91.96 | 79.87 | 76.68 | |
Figure 1A crossover network for the Wnt-related pathway.
Comparison of different protein representation approaches by our method.
| Dataset | ACC(%) | |||||
|---|---|---|---|---|---|---|
| AAC | BLOSUM62 | AAC + BLOSUM62 | PSSM | AAS | PR | |
| 94.83 ± 0.26 | 94.32 ± 0.21 | 94.34 ± 0.63 | 94.21 ± 0.57 | 94.19 ± 0.66 | 93.37 ± 0.38 | |
| 89.06 ± 0.96 | 88.62 ± 1.13 | 89.16 ± 1.09 | 88.51 ± 1.04 | 87.59 ± 1.27 | 84.67 ± 1.29 | |
| 97.60 ± 0.29 | 97.56 ± 0.13 | 97.59 ± 0.16 | 97.55 ± 0.33 | 97.46 ± 0.48 | 96.56 ± 0.91 | |
Figure 2The schematic diagram for calculating Histogram of Oriented Gradient (HOG).