| Literature DB >> 27677692 |
Yijie Ding1, Jijun Tang1,2, Fei Guo3.
Abstract
BACKGROUND: Protein-protein interactions (PPIs) are central to a lot of biological processes. Many algorithms and methods have been developed to predict PPIs and protein interaction networks. However, the application of most existing methods is limited since they are difficult to compute and rely on a large number of homologous proteins and interaction marks of protein partners. In this paper, we propose a novel sequence-based approach with multivariate mutual information (MMI) of protein feature representation, for predicting PPIs via Random Forest (RF).Entities:
Keywords: Conjoint amino acids; Feature extraction; Multivariate mutual information; Protein sequence; Protein-protein interactions
Year: 2016 PMID: 27677692 PMCID: PMC5039908 DOI: 10.1186/s12859-016-1253-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Division of 20 amino acid types, based on dipoles and volumes of side chains
|
|
|
|
|
|---|---|---|---|
|
|
| Dipole <1.0 | Volume <50 |
|
|
| 1.0< Dipole <2.0 (form disulphide bonds) | Volume >50 |
|
|
| Dipole >3.0 (opposite orientation) | Volume >50 |
|
|
| Dipole <1.0 | Volume >50 |
|
|
| 2.0< dipole <3.0 | Volume >50 |
|
|
| Dipole >3.0 | Volume >50 |
|
|
| 1.0< dipole <2.0 | Volume >50 |
Fig. 13-gram or 2-gram feature representation
Original values of six physicochemical properties of 20 amino acid types
| Amino acid | H | VSC | P1 | P2 | SASA | NCISC |
|---|---|---|---|---|---|---|
| A | 0.62 | 27.5 | 8.1 | 0.046 | 1.181 | 0.007187 |
| C | 0.29 | 44.6 | 5.5 | 0.128 | 1.461 | -0.03661 |
| D | -0.9 | 40 | 13 | 0.105 | 1.587 | -0.02382 |
| E | -0.74 | 62 | 12.3 | 0.151 | 1.862 | 0.006802 |
| F | 1.19 | 115.5 | 5.2 | 0.29 | 2.228 | 0.037552 |
| G | 0.48 | 0 | 9 | 0 | 0.881 | 0.179052 |
| H | -0.4 | 79 | 10.4 | 0.23 | 2.025 | -0.01069 |
| I | 1.38 | 93.5 | 5.2 | 0.186 | 1.81 | 0.021631 |
| K | -1.5 | 100 | 11.3 | 0.219 | 2.258 | 0.017708 |
| L | 1.06 | 93.5 | 4.9 | 0.186 | 1.931 | 0.051672 |
| M | 0.64 | 94.1 | 5.7 | 0.221 | 2.034 | 0.002683 |
| N | -0.78 | 58.7 | 11.6 | 0.134 | 1.655 | 0.005392 |
| P | 0.12 | 41.9 | 8 | 0.131 | 1.468 | 0.239531 |
| Q | -0.85 | 80.7 | 10.5 | 0.18 | 1.932 | 0.049211 |
| R | -2.53 | 105 | 10.5 | 0.291 | 2.56 | 0.043587 |
| S | -0.18 | 29.3 | 9.2 | 0.062 | 1.298 | 0.004627 |
| T | -0.05 | 51.3 | 8.6 | 0.108 | 1.525 | 0.003352 |
| V | 1.08 | 71.5 | 5.9 | 0.14 | 1.645 | 0.057004 |
| W | 0.81 | 145.5 | 5.4 | 0.409 | 2.663 | 0.037977 |
| Y | 0.26 | 117.3 | 6.2 | 0.298 | 2.368 | 0.023599 |
Analyze the performance of 2-tuples and 3-tuples MI on S.cerevisiae dataset
| Feature | Classifier |
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|
| 2-tuples MI | RF | 93.56 ±0.23 | 89.98 ±0.51 | 97.41 ±0.64 | 97.38 ±0.58 | 90.06 ±0.45 | 93.54 ±0.41 | 87.42 ±0.83 |
| 3-tuples MI | RF | 93.88 ±0.25 | 90.25 ±0.42 | 97.30 ±0.50 | 96.94 ±0.44 | 91.35 ±0.55 | 93.47 ±0.39 | 87.92 ±0.77 |
| MMI | RF | 94.23 ±0.36 | 91.01 ±0.45 | 97.44 ±0.40 | 97.27 ±0.38 | 91.55 ±0.48 | 94.03 ±0.35 | 88.63 ±0.71 |
Fig. 2Accuracy of our method with NMBAC on different values of lag
Analyze the performance of MMI and NMBAC on S.cerevisiae dataset by RF Classifier
| Feature |
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|
| MMI | 94.23 ±0.36 | 91.01 ±0.45 | 97.44 ±0.40 | 97.27 ±0.38 | 91.55 ±0.48 | 94.03 ±0.35 | 88.63 ±0.71 |
| NMBAC | 92.76 ±0.35 | 90.99 ±0.59 | 94.53 ±0.50 | 94.34 ±0.37 | 91.30 ±0.68 | 92.63 ±0.26 | 85.57 ±0.70 |
| MMI+NMBAC(A-B order) | 95.01 ±0.46 | 92.67 ±0.50 | 97.31 ±0.61 | 97.16 ±0.55 | 93.06 ±0.48 | 94.26 ±1.18 | 90.10 ±0.92 |
| MMI+NMBAC(B-A order) | 94.90 ±0.24 | 92.60 ±0.47 | 97.22 ±0.58 | 97.10 ±0.44 | 92.89 ±0.55 | 94.79 ±0.78 | 89.91 ±1.1 |
5-fold cross-validation result obtained by using our proposed method on S.cerevisiae dataset
| Testing set |
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|
| 1 | 95.41 | 93.15 | 97.60 | 97.46 | 93.54 | 92.26 | 90.88 |
| 2 | 94.99 | 92.03 | 97.82 | 97.57 | 92.80 | 94.72 | 90.11 |
| 3 | 94.28 | 92.31 | 96.29 | 96.23 | 92.44 | 94.23 | 88.64 |
| 4 | 94.95 | 92.69 | 97.22 | 97.10 | 92.97 | 94.84 | 89.99 |
| 5 | 95.40 | 93.15 | 97.60 | 97.46 | 93.54 | 95.26 | 90.88 |
| Average | 95.01 ±0.46 | 92.67 ±0.5 | 97.31 ±0.61 | 97.16 ±0.55 | 93.06 ±0.48 | 94.26 ±1.18 | 90.1 ±0.92 |
Comparison of the prediction performance between our proposed method and other state-of-the-art works on S.cerevisiae dataset
| Method | Feature | Classifier |
|
|
|
|
|---|---|---|---|---|---|---|
| Our method | MMI+NMBAC | RF | 95.01 ±0.46 | 92.67 ±0.50 | 97.16 ±0.55 | 90.10 ±0.92 |
| You’s work [ | MLD | RF | 94.72 ±0.43 | 94.34 ±0.49 | 98.91 ±0.33 | 85.99 ±0.89 |
| You’s work [ | AC+CT+LD+MAC | E-ELM | 87.00 ±0.29 | 86.15 ±0.43 | 87.59 ±0.32 | 77.36 ±0.44 |
| You’s work [ | MCD | SVM | 91.36 ±0.36 | 90.67 ±0.69 | 91.94 ±0.62 | 84.21 ±0.59 |
| Wong’s work [ | PR-LPQ | Rotation Forest | 93.92 ±0.36 | 91.10 ±0.31 | 96.45 ±0.45 | 88.56 ±0.63 |
| Guo’s work [ | ACC | SVM | 89.33 ±2.67 | 89.93 ±3.68 | 88.87 ±6.16 | N/A a |
| Guo’s work [ | AC | SVM | 87.36 ±1.38 | 87.30 ±4.68 | 87.82 ±4.33 | N/A a |
| Zhou’s work [ | LD | SVM | 88.56 ±0.33 | 87.37 ±0.22 | 89.50 ±0.60 | 77.15 ±0.68 |
| Yang’s work [ | LD | KNN | 86.15 ±1.17 | 81.03 ±1.74 | 90.24 ±1.34 | N/A a |
aN/A means not available
Comparison of the prediction performance between our proposed method and other different methods on H.pylori dataset
| Methods |
|
|
|
|
|---|---|---|---|---|
| Our method(MMI + NMBAC) | 87.59 | 86.81 | 88.23 | 75.24 |
| Our method(MMI) | 85.42 | 85.22 | 87.70 | 70.71 |
| Our method(NMBAC) | 85.59 | 83.33 | 89.53 | 71.35 |
| You’s work(AC+CT+LD+MAC) [ | 87.50 | 88.95 | 86.15 | 78.13 |
| You’s work(MCD)[ | 84.91 | 83.24 | 86.12 | 74.40 |
| Huang’s work(DCT + SMR) [ | 86.74 | 86.43 | 87.01 | 76.99 |
| Phylogenetic bootstrap [ | 75.80 | 69.80 | 80.20 | N/A a |
| HKNN [ | 84.00 | 86.00 | 84.00 | N/A a |
| Signature products [ | 83.40 | 79.90 | 85.70 | N/A a |
| Ensemble of HKNN [ | 86.60 | 86.70 | 85.00 | N/A a |
| Boosting | 79.52 | 80.37 | 81.69 | 70.64 |
aN/A means not available
Comparison of the prediction performance between our proposed method and other different methods on human 8161 dataset
| Methods |
|
|
|
|
|---|---|---|---|---|
| Our method(MMI + NMBAC) | 97.56 | 96.57 | 98.30 | 95.13 |
| Our method(MMI) | 96.08 | 95.05 | 96.97 | 92.17 |
| Our method(NMBAC) | 95.59 | 94.06 | 96.94 | 91.21 |
| Huang’s work(DCT + SMR) [ | 96.30 | 92.63 | 99.59 | 92.82 |
Prediction results on five independent species by our proposed method, based on S.cerevisiae dataset as the training set
| Species | Testing pairs | ACC(%) | |||||
|---|---|---|---|---|---|---|---|
| MMI + NMBAC | MMI | NMBAC | You’s work [ | Huang’s work [ | Zhou’s work [ | ||
|
| 6954 | 92.80 | 89.01 | 90.13 | 89.30 | 66.08 | 71.24 |
|
| 4013 | 92.16 | 88.54 | 86.72 | 87.71 | 81.19 | 75.73 |
|
| 1412 | 94.33 | 91.31 | 90.23 | 94.19 | 82.22 | 76.27 |
|
| 1420 | 91.13 | 90.28 | 90.34 | 90.99 | 82.18 | N/A a |
|
| 313 | 95.85 | 92.01 | 91.37 | 91.96 | 79.87 | 76.68 |
aN/A means not available
Comparison of prediction performance between our proposed method and other seven methods on the yeast dataset
| Method | AUROC | AUPRC | ||||||
|---|---|---|---|---|---|---|---|---|
| CV | C1 | C2 | C3 | CV | C1 | C2 | C3 | |
| MMI+NMBAC | 0.82 ±0.02 | 0.82 ±0.01 | 0.62 ±0.02 | 0.61 ±0.02 | 0.84 ±0.01 | 0.84 ±0.01 | 0.64 ±0.02 | 0.62 ±0.02 |
| MMI | 0.82 ±0.01 | 0.82 ±0.01 | 0.62 ±0.02 | 0.60 ±0.02 | 0.84 ±0.02 | 0.84 ±0.01 | 0.64 ±0.02 | 0.61 ±0.02 |
| NMBAC | 0.82 ±0.01 | 0.82 ±0.01 | 0.61 ±0.02 | 0.60 ±0.03 | 0.83 ±0.01 | 0.83 ±0.01 | 0.63 ±0.03 | 0.60 ±0.03 |
| M1 | 0.82 ±0.01 | 0.82 ±0.01 | 0.61 ±0.02 | 0.58 ±0.03 | 0.83 ±0.02 | 0.83 ±0.01 | 0.62 ±0.02 | 0.57 ±0.03 |
| M2 | 0.83 ±0.01 | 0.84 ±0.01 | 0.60 ±0.02 | 0.59 ±0.03 | 0.84 ±0.02 | 0.84 ±0.01 | 0.61 ±0.02 | 0.58 ±0.03 |
| M3 | 0.61 ±0.01 | 0.61 ±0.01 | 0.53 ±0.01 | 0.50 ±0.01 | 0.65 ±0.02 | 0.65 ±0.02 | 0.56 ±0.03 | 0.53 ±0.07 |
| M4 | 0.76 ±0.02 | 0.76 ±0.02 | 0.57 ±0.02 | 0.54 ±0.03 | 0.76 ±0.02 | 0.76 ±0.02 | 0.58 ±0.02 | 0.54 ±0.03 |
| M5 | 0.80 ±0.02 | 0.80 ±0.01 | 0.58 ±0.01 | 0.55 ±0.02 | 0.78 ±0.02 | 0.78 ±0.01 | 0.57 ±0.02 | 0.54 ±0.02 |
| M6 | 0.75 ±0.02 | 0.75 ±0.02 | 0.59 ±0.04 | 0.52 ±0.04 | 0.75 ±0.02 | 0.76 ±0.02 | 0.60 ±0.05 | 0.47 ±0.07 |
| M7 | 0.58 ±0.02 | 0.58 ±0.01 | 0.54 ±0.02 | 0.52 ±0.03 | 0.60 ±0.02 | 0.60 ±0.02 | 0.55 ±0.02 | 0.53 ±0.02 |
Comparison of prediction performance between our proposed method and other seven methods on the human dataset
| Method | AUROC | AUPRC | ||||||
|---|---|---|---|---|---|---|---|---|
| CV | C1 | C2 | C3 | CV | C1 | C2 | C3 | |
| MMI+NMBAC | 0.82 ±0.01 | 0.82 ±0.01 | 0.60 ±0.01 | 0.57 ±0.02 | 0.83 ±0.01 | 0.83 ±0.01 | 0.60 ±0.01 | 0.56 ±0.02 |
| MMI | 0.81 ±0.01 | 0.81 ±0.01 | 0.59 ±0.01 | 0.56 ±0.02 | 0.82 ±0.01 | 0.83 ±0.01 | 0.59 ±0.01 | 0.55 ±0.01 |
| NMBAC | 0.81 ±0.01 | 0.82 ±0.01 | 0.60 ±0.01 | 0.57 ±0.02 | 0.83 ±0.01 | 0.83 ±0.01 | 0.60 ±0.01 | 0.56 ±0.02 |
| M1 | 0.81 ±0.01 | 0.81 ±0.01 | 0.61 ±0.01 | 0.58 ±0.03 | 0.82 ±0.01 | 0.82 ±0.01 | 0.60 ±0.01 | 0.57 ±0.03 |
| M2 | 0.85 ±0.01 | 0.85 ±0.01 | 0.60 ±0.01 | 0.58 ±0.02 | 0.85 ±0.00 | 0.85 ±0.01 | 0.60 ±0.01 | 0.56 ±0.02 |
| M3 | 0.63 ±0.01 | 0.64 ±0.01 | 0.55 ±0.01 | 0.50 ±0.00 | 0.67 ±0.01 | 0.67 ±0.01 | 0.57 ±0.02 | 0.52 ±0.05 |
| M4 | 0.77 ±0.01 | 0.77 ±0.01 | 0.57 ±0.02 | 0.53 ±0.02 | 0.77 ±0.01 | 0.77 ±0.01 | 0.56 ±0.01 | 0.53 ±0.02 |
| M5 | 0.81 ±0.01 | 0.81 ±0.01 | 0.59 ±0.01 | 0.54 ±0.02 | 0.82 ±0.01 | 0.82 ±0.01 | 0.59 ±0.01 | 0.54 ±0.02 |
| M6 | 0.76 ±0.01 | 0.77 ±0.01 | 0.64 ±0.01 | 0.59 ±0.02 | 0.79 ±0.01 | 0.79 ±0.01 | 0.67 ±0.01 | 0.59 ±0.02 |
| M7 | 0.56 ±0.01 | 0.56 ±0.01 | 0.53 ±0.01 | 0.54 ±0.02 | 0.56 ±0.01 | 0.56 ±0.01 | 0.53 ±0.01 | 0.54 ±0.02 |
Comparison of prediction performance between our proposed method and other seven methods on new yeast dataset, suppressing representation bias-driven learning
| Method | AUROC | AUPRC | ||||||
|---|---|---|---|---|---|---|---|---|
| CV | C1 | C2 | C3 | CV | C1 | C2 | C3 | |
| MMI+NMBAC | 0.65 ±0.02 | 0.66 ±0.02 | 0.60 ±0.02 | 0.55 ±0.02 | 0.67 ±0.02 | 0.68 ±0.02 | 0.60 ±0.02 | 0.55 ±0.02 |
| MMI | 0.64 ±0.02 | 0.65 ±0.01 | 0.60 ±0.02 | 0.55 ±0.02 | 0.66 ±0.02 | 0.68 ±0.01 | 0.60 ±0.02 | 0.54 ±0.02 |
| NMBAC | 0.63 ±0.02 | 0.64 ±0.02 | 0.59 ±0.02 | 0.54 ±0.03 | 0.65 ±0.02 | 0.66 ±0.02 | 0.59 ±0.02 | 0.54 ±0.02 |
| M1 | 0.64 ±0.01 | 0.64 ±0.01 | 0.62 ±0.02 | 0.57 ±0.04 | 0.65 ±0.01 | 0.65 ±0.01 | 0.61 ±0.02 | 0.56 ±0.03 |
| M2 | 0.61 ±0.01 | 0.61 ±0.02 | 0.62 ±0.02 | 0.58 ±0.03 | 0.61 ±0.01 | 0.61 ±0.02 | 0.62 ±0.02 | 0.57 ±0.03 |
| M3 | 0.54 ±0.01 | 0.55 ±0.01 | 0.53 ±0.01 | 0.50 ±0.01 | 0.60 ±0.02 | 0.60 ±0.01 | 0.56 ±0.03 | 0.53 ±0.07 |
| M4 | 0.55 ±0.02 | 0.55 ±0.02 | 0.54 ±0.02 | 0.51 ±0.02 | 0.53 ±0.02 | 0.53 ±0.01 | 0.53 ±0.02 | 0.51 ±0.02 |
| M5 | 0.60 ±0.02 | 0.60 ±0.01 | 0.55 ±0.02 | 0.52 ±0.02 | 0.61 ±0.02 | 0.61 ±0.01 | 0.55 ±0.02 | 0.51 ±0.02 |
| M7 | 0.55 ±0.02 | 0.54 ±0.01 | 0.54 ±0.02 | 0.53 ±0.03 | 0.55 ±0.02 | 0.55 ±0.01 | 0.54 ±0.02 | 0.53 ±0.02 |
Comparison of prediction performance between our proposed method and other seven methods on new human dataset, suppressing representation bias-driven learning
| Method | AUROC | AUPRC | ||||||
|---|---|---|---|---|---|---|---|---|
| CV | C1 | C2 | C3 | CV | C1 | C2 | C3 | |
| MMI+NMBAC | 0.61 ±0.01 | 0.62 ±0.01 | 0.57 ±0.02 | 0.53 ±0.01 | 0.64 ±0.01 | 0.65 ±0.01 | 0.58 ±0.02 | 0.53 ±0.01 |
| MMI | 0.61 ±0.01 | 0.62 ±0.01 | 0.57 ±0.01 | 0.53 ±0.01 | 0.64 ±0.01 | 0.65 ±0.01 | 0.58 ±0.01 | 0.53 ±0.01 |
| NMBAC | 0.59 ±0.01 | 0.60 ±0.01 | 0.56 ±0.01 | 0.52 ±0.02 | 0.62 ±0.01 | 0.63 ±0.01 | 0.56 ±0.01 | 0.52 ±0.01 |
| M1 | 0.64 ±0.01 | 0.65 ±0.01 | 0.61 ±0.01 | 0.57 ±0.02 | 0.66 ±0.01 | 0.67 ±0.01 | 0.61 ±0.02 | 0.56 ±0.02 |
| M2 | 0.59 ±0.01 | 0.60 ±0.01 | 0.60 ±0.01 | 0.57 ±0.02 | 0.60 ±0.01 | 0.61 ±0.01 | 0.59 ±0.01 | 0.55 ±0.01 |
| M3 | 0.54 ±0.01 | 0.55 ±0.01 | 0.53 ±0.01 | 0.50 ±0.00 | 0.61 ±0.01 | 0.61 ±0.01 | 0.56 ±0.02 | 0.52 ±0.05 |
| M4 | 0.56 ±0.01 | 0.56 ±0.01 | 0.54 ±0.01 | 0.52 ±0.02 | 0.54 ±0.01 | 0.54 ±0.01 | 0.53 ±0.01 | 0.52 ±0.01 |
| M5 | 0.59 ±0.01 | 0.60 ±0.01 | 0.56 ±0.01 | 0.53 ±0.01 | 0.63 ±0.01 | 0.64 ±0.01 | 0.57 ±0.01 | 0.53 ±0.01 |
| M7 | 0.55 ±0.01 | 0.55 ±0.01 | 0.53 ±0.01 | 0.53 ±0.03 | 0.55 ±0.01 | 0.55 ±0.01 | 0.53 ±0.01 | 0.54 ±0.02 |
Fig. 3An one-core network for the CD9 network
Fig. 4A multiple-cores network for the Ras-Raf-Mek-Erk-Elk-Srf pathway
Fig. 5A crossover network for the Wnt-related pathway