| Literature DB >> 28792503 |
Chang Zhou1,2, Hua Yu1,2, Yijie Ding1,2, Fei Guo1,2, Xiu-Jun Gong1,2.
Abstract
Nowadays a number of computational approaches have been developed to effectively and accurately predict protein interactions. However, most of these methods typically perform worse when other biological data sources (e.g., protein structure information, protein domains, or gene neighborhoods information) are not available. In the present work, we propose a method for predicting protein interactions making full use of physicochemical characteristics of amino acids. A protein sequence is encoded at multi-scale by seven properties, including their qualitative and quantitative descriptions, of amino acids. Five kinds of protein descriptors, frequency, composition, transformation, distribution and auto covariance, are extracted from these encodings for representing each protein sequence. The new formed feature representation consisted of 347 dimensions is able to capture not only the compositional and positional information but also their statistical significance of amino acids in the sequence. Based on such a feature representation, the gradient boosting decision tree algorithm is introduced to predict protein interaction class. When the proposed method is tested with the PPI data of S.cerevisiae, it achieves a prediction accuracy of 95.28% at the Matthew's correlation coefficient of 90.68%. Compared with the state-of-the-art works on H.pylori and Human, the accuracies can be raised to 89.27% and 98.00% respectively. Extensive experiments are performed for a crossover protein-protein interactions network and the prediction accuracies are also very promising. Because of learning capabilities of the gradient boosting decision tree and the mutil-scale feature representation scheme, the proposed method might be a useful tool for future proteomics studies.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28792503 PMCID: PMC5549711 DOI: 10.1371/journal.pone.0181426
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
The distributions of the goldend positive and negative samples.
| Dataset | #GPS | #GNS | #Total |
|---|---|---|---|
| 5594 | 5594 | 11188 | |
| 1458 | 1458 | 2916 | |
| 3899 | 4262 | 8161 |
Five-fold cross-validation on S.cerevisiae dataset.
| Testset | ACC% | SN% | PPV% | F-score% | MCC% |
|---|---|---|---|---|---|
| 1 | 95.22 | 92.75 | 97.79 | 95.20 | 90.57 |
| 2 | 95.00 | 91.96 | 98.02 | 94.90 | 90.17 |
| 3 | 94.77 | 91.99 | 97.45 | 94.64 | 89.69 |
| 4 | 95.57 | 92.86 | 98.09 | 95.40 | 91.27 |
| 5 | |||||
| Average±Std | 95.28±0.38 | 92.75±0.81 | 97.18±0.62 | 97.70±2.22 | 90.68±0.72 |
Five-fold cross-validation on Huamn dataset.
| Testset | ACC% | SN% | PPV% | F-score% | MCC% |
|---|---|---|---|---|---|
| 1 | 98.16 | 97.05 | 99.08 | 98.05 | 96.33 |
| 2 | 97.18 | 95.35 | 98.83 | 97.06 | 94.41 |
| 3 | |||||
| 4 | 98.35 | 97.47 | 98.92 | 98.19 | 96.67 |
| 5 | 97.92 | 97.04 | 98.56 | 97.79 | 95.83 |
| Average±Std | 98.00±0.44 | 96.90±0.81 | 98.90±0.19 | 97.89±0.45 | 96.01±0.87 |
Contribution of QLC, QNC and QLC+QNC on S.cerevisiae dataset.
| Feature | ACC% | SN% | PPV% | F-score% | MCC% |
|---|---|---|---|---|---|
| QLC | 94.63±0.30 | 91.23±0.50 | 97.89±0.13 | 94.44±0.21 | 89.46±0.54 |
| QNC | 94.68±0.40 | 92.06±0.55 | 97.16±0.41 | 94.54±0.42 | 89.48±0.81 |
| QLC+QNC | 95.28±0.38 | 92.75±0.81 | 97.18±0.62 | 97.70±2.22 | 90.68±0.72 |
Contribution of QLC, QNC and QLC+QNC on Huamn dataset.
| Feature | ACC% | SN% | PPV% | F-score% | MCC% |
|---|---|---|---|---|---|
| QLC | 97.74±0.45 | 96.38±0.85 | 98.84±0.21 | 97.59±0.50 | 95.48±0.90 |
| QNC | 97.87±0.13 | 96.43±0.40 | 99.08±0.18 | 97.74±0.16 | 95.75±0.25 |
| QLC+QNC | 98.00±0.44 | 96.90±0.81 | 98.90±0.19 | 97.89±0.45 | 96.01±0.87 |
Fig 1Comparison of ACC by different classifiers.
Fig 4Comparison of MCC by different classifiers.
Performance comparison using different classifiers on three datasets.
| Classifier | ACC% | SN% | PPV% | F-score% | MCC% | |
|---|---|---|---|---|---|---|
| SVM | 93.25 | 91.82 | 94.16 | 92.97 | 86.51 | |
| RF | 94.61 | 91.71 | 97.34 | 94.44 | 89.37 | |
| GBDT | 95.28 | 92.75 | 97.18 | 97.70 | 90.68 | |
| SVM | 85.94 | 85.87 | 86.00 | 85.90 | 71.91 | |
| RF | 86.28 | 86.61 | 86.01 | 86.27 | 72.58 | |
| GBDT | 89.27 | 91.05 | 87.98 | 89.44 | 78.60 | |
| SVM | 96.45 | 93.93 | 98.58 | 96.20 | 92.96 | |
| RF | 97.57 | 96.39 | 98.51 | 97.44 | 95.15 | |
| GBDT | 98.00 | 96.90 | 98.90 | 97.89 | 96.01 |
The performance of different methods on S.cerevisiae dataset.
| Method | Feature | Classifier | ACC% | SN% | PPV% | MCC% |
|---|---|---|---|---|---|---|
| Our | QLC+QNC | GBDT | 92.28 | 97.90 | ||
| Ding | HOG+SVD | RF | 94.83 | 97.10 | 89.77 | |
| You | MLD | RF | 94.72 | 94.34 | 85.99 | |
| You | AC+CT+LD+MAC | E-ELM | 87.00 | 86.15 | 87.59 | 77.36 |
| You | MCD | SVM | 91.36 | 90.67 | 91.94 | 84.21 |
| Wong | PR-LPQ | Rotation F | 93.92 | 91.10 | 96.45 | 88.56 |
| Gou | ACC | SVM | 89.33 | 89.93 | 88.87 | NA |
| Gou | AC | SVM | 87.36 | 87.30 | 87.82 | NA |
| Zhou | LD | SVM | 88.56 | 87.37 | 89.50 | 77.15 |
| Yang | LD | KNN | 86.15 | 81.03 | 90.24 | NA |
The performance of different methods on H.pylori dataset.
| Method | ACC% | SN% | PPV% | MCC% |
|---|---|---|---|---|
| Our | 91.05 | 87.98 | 78.62 | |
| Ding’s work(HOG+SVD) | 89.06 | 88.15 | 78.15 | |
| Ding’s work(MMI+NMBAC) | 87.59 | 86.81 | 88.23 | 75.24 |
| You’s work(MLD) | 88.30 | 85.99 | ||
| You’s work(AC+CT+LD+MAC) | 87.50 | 88.95 | 86.15 | 78.13 |
| You’s work(MCD) | 84.91 | 83.24 | 86.12 | 74.40 |
| Huang’s work(DCT+SMR) | 86.74 | 86.43 | 87.01 | 76.99 |
| Zhou’s work | 84.20 | 85.10 | 83.30 | NA |
Fig 5The prediction on the Wnt-related pathway network.
Fig 6The architecture of the proposed method.
Seven physicochemical properties for 20 amino acid types.
| Amino acid | Group1 | Group2 | Group3 |
|---|---|---|---|
| Hydrophobicity | Polar | Neutral | Hydrophobicity |
| R,K,E,D,Q,N | G,A,S,T,P,H,Y | C,L,V,I,M,F,W | |
| Normalized van der Waals volume | 0-2.78 | 2.95-4.0 | 4.03-8.08 |
| G,A,S,T,P,D | N,V,E,C,Q,I,L | M,H,K,F,R,Y,W | |
| Polarity | 4.9-6.2 | 8.0-9.2 | 10.4-13.0 |
| L,I,F,W,C,M,V,Y | P,A,T,G,S | H,Q,R,K,N,E,D | |
| Polarizability | 0-1.08 | 0.128-0.186 | 0.219-0.409 |
| G,A,S,D,T | C,P,N,V,E,Q,I,L | K,M,H,F,R,Y,W | |
| Charge | Positive | Neutral | Negative |
| K,R | A,N,C,Q,G,H,I,L,M,F,P,S,T,W,Y,V | D,E | |
| Secondary structure | Helix | Strand | Coil |
| E,A,L,M,Q,K,R,H | V,I,T,C,W,F,T | G,N,P,S,D | |
| Solvent-accessible | Buried | Exposed | Intermediate |
| A,L,F,C,G,I,V,W | R,K,Q,E,N,D | M,S,P,T,H,Y |
Six physicochemical properties for 20 amino acid types.
| Amino acid | H | VSC | P1 | P2 | SASA | NCIS |
|---|---|---|---|---|---|---|
| A | 0.62 | 27.5 | 8.1 | 0.046 | 1.181 | 0.007187 |
| C | 0.29 | 44.6 | 5.5 | 0.128 | 1.461 | -0.03661 |
| D | -0.9 | 40 | 13 | 0.105 | 1.587 | -0.02382 |
| E | -0.74 | 62 | 12.3 | 0.151 | 1.862 | -0.006802 |
| F | 1.19 | 115.5 | 5.2 | 0.29 | 2.228 | 0.037552 |
| G | 0.48 | 0 | 9 | 0 | 0.881 | 0.179052 |
| H | -0.4 | 79 | 10.4 | 0.23 | 2.025 | -0.01069 |
| I | 1.38 | 93.5 | 5.2 | 0.186 | 1.81 | 0.021631 |
| K | -1.5 | 100 | 11.3 | 0.219 | 2.258 | 0.017708 |
| L | 1.06 | 93.5 | 4.9 | 0.186 | 1.931 | 0.051672 |
| M | 0.64 | 94.1 | 5.7 | 0.221 | 2.034 | 0.002683 |
| N | -0.78 | 58.7 | 11.6 | 0.134 | 1.655 | 0.005392 |
| P | 0.12 | 41.9 | 8 | 0.131 | 1.468 | 0.239531 |
| Q | -0.85 | 80.7 | 10.5 | 0.18 | 1.932 | 0.049211 |
| R | -2.53 | 105 | 10.5 | 0.291 | 2.56 | 0.043587 |
| S | -0.18 | 29.3 | 9.2 | 0.062 | 1.298 | 0.004627 |
| T | -0.05 | 51.3 | 8.6 | 0.108 | 1.525 | 0.003352 |
| V | 1.08 | 71.5 | 5.9 | 0.14 | 1.645 | 0.057004 |
| W | 0.81 | 145.5 | 5.4 | 0.409 | 2.663 | 0.037977 |
| Y | 0.26 | 117.3 | 6.2 | 0.298 | 2.368 | 0.0323599 |
Five-fold cross-validation on H.pylori dataset.
| Testset | ACC% | SN% | PPV% | F-score% | MCC% |
|---|---|---|---|---|---|
| 1 | 89.21 | 92.00 | 88.99 | 90.47 | 78.11 |
| 2 | 87.14 | 90.32 | 84.00 | 87.05 | 74.50 |
| 3 | 88.34 | 84.39 | 88.63 | 77.13 | |
| 4 | 91.67 | ||||
| 5 | 89.71 | 87.94 | 90.51 | 89.21 | 79.41 |
| Average±Std | 89.27±1.59 | 91.05±1.82 | 87.98±3.23 | 89.44±1.62 | 78.60±3.08 |
Contribution of QLC, QNC and QLC+QNC on H.pylori dataset.
| Feature | ACC% | SN% | PPV% | F-score% | MCC% |
|---|---|---|---|---|---|
| QLC | 88.10±0.74 | 88.35±1.07 | 87.92±1.90 | 88.12±0.86 | 76.23±1.45 |
| QNC | 88.17±1.18 | 90.75±1.18 | 86.33±2.32 | 88.46±1.11 | 76.47±2.28 |
| QLC+QNC | 89.27±1.59 | 91.05±1.82 | 87.98±3.23 | 89.44±1.62 | 78.60±3.08 |