| Literature DB >> 26157620 |
Julian Zubek1, Marcin Tatjewski1, Adam Boniecki2, Maciej Mnich3, Subhadip Basu4, Dariusz Plewczynski5.
Abstract
Accurate identification of protein-protein interactions (PPI) is the key step in understanding proteins' biological functions, which are typically context-dependent. Many existing PPI predictors rely on aggregated features from protein sequences, however only a few methods exploit local information about specific residue contacts. In this work we present a two-stage machine learning approach for prediction of protein-protein interactions. We start with the carefully filtered data on protein complexes available for Saccharomyces cerevisiae in the Protein Data Bank (PDB) database. First, we build linear descriptions of interacting and non-interacting sequence segment pairs based on their inter-residue distances. Secondly, we train machine learning classifiers to predict binary segment interactions for any two short sequence fragments. The final prediction of the protein-protein interaction is done using the 2D matrix representation of all-against-all possible interacting sequence segments of both analysed proteins. The level-I predictor achieves 0.88 AUC for micro-scale, i.e., residue-level prediction. The level-II predictor improves the results further by a more complex learning paradigm. We perform 30-fold macro-scale, i.e., protein-level cross-validation experiment. The level-II predictor using PSIPRED-predicted secondary structure reaches 0.70 precision, 0.68 recall, and 0.70 AUC, whereas other popular methods provide results below 0.6 threshold (recall, precision, AUC). Our results demonstrate that multi-scale sequence features aggregation procedure is able to improve the machine learning results by more than 10% as compared to other sequence representations. Prepared datasets and source code for our experimental pipeline are freely available for download from: http://zubekj.github.io/mlppi/ (open source Python implementation, OS independent).Entities:
Keywords: Interaction patches; Local sequence-structure segments; Machine learning; Multi-scale models; Physico-chemical indices; Protein interaction networks; Protein sequence; Protein-protein interactions; Sequence segments
Year: 2015 PMID: 26157620 PMCID: PMC4493684 DOI: 10.7717/peerj.1041
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1Schematic depiction of our two-stage ensemble method.
Figure 2Schema of train-test split for evaluation of trained classifiers.
Numbers of examples used in each step are given in parentheses.
Figure 3The level-I prediction matrices for two protein pairs.
White colour corresponds to score 0.0, black colour corresponds to score 1.0.
ROC AUC scores of level-I predictor trained on different sets of features.
Interaction threshold was set to 15. Extraction window size was set to 21. A+B denotes feature vector constructed by concatenating two sets of features. RF—Random Forest, 300 trees, maximum tree depth 15, SVM—Support Vector Machine, RBF kernel, C = 2, γ = 0.048. For level-II Random Forest with 300 trees and maximum tree depth 7 was used. Main scores were calculated for PSIPRED-predicted secondary structure, values in parentheses concern scores for DSSP secondary structure.
| Classifier | Features | Lvl-I AUC | Lvl-II AUC |
|---|---|---|---|
|
| Raw sequence | 0.64 | 0.59 |
| HQI8 | 0.70 | 0.59 | |
| PSIPRED structure | 0.67 | 0.63 | |
| PSIPRED structure + Sequence | 0.69 | 0.60 | |
| PSIPRED structure + HQI8 | 0.72 | 0.56 | |
| DSSP structure | 0.72 (0.87) | 0.70 | |
| DSSP structure + Sequence | 0.73 (0.87) | 0.65 | |
| DSSP structure + HQI8 | 0.74 (0.85) | 0.64 | |
| SVM | DSSP structure | 0.59 (0.84) | 0.57 |
ROC AUC scores of level-I predictor for different interaction thresholds.
DSSP-extracted secondary structure was used for constructing feature vector. Extraction window size was set to 21. For level-I Random Forest, 300 trees, maximum tree depth 7 was used. For level-II Random Forest with 300 trees, maximum tree depth 7 was used. Main scores were calculated for PSIPRED-predicted secondary structure, values in parentheses concern scores for DSSP secondary structure.
| Threshold | Lvl-I AUC | Lvl-II AUC |
|---|---|---|
| 0 | 0.67 (0.84) | 0.67 |
| 5 | 0.67 (0.85) | 0.67 |
| 10 | 0.69 (0.86) | 0.68 |
| 15 | 0.72 (0.87) | 0.70 |
| 20 | 0.75 (0.88) | 0.64 |
Figure 4ROC AUC scores of level-I predictor trained on secondary structure for different extraction window sizes.
Random Forest was used as the classifier.
Performance scores.
t = x denotes interaction threshold of x interacting residues. Level-II predictor used secondary structure predicted by PSIPRED. RF—Random Forest, 300 trees, maximum tree depth 7, SVM—Support Vector Machine, RBF kernel, C = 1, γ = 2.
| Clf | Features | Accuracy | Precision | Recall | AUC |
|---|---|---|---|---|---|
| SVM | Lvl-II pred ( | 0.55 | 0.58 | 0.55 | 0.57 |
| AAC | 0.54 | 0.56 | 0.66 | 0.54 | |
| PseAAC | 0.54 | 0.55 | 0.61 | 0.55 | |
| 2grams | 0.55 | 0.56 | 0.64 | 0.55 | |
| QRC | 0.51 | 0.53 | 0.59 | 0.53 | |
| Liu’s dev (HQI8) | 0.55 | 0.57 | 0.60 | 0.56 | |
| Liu’s dev (original) | 0.55 | 0.57 | 0.60 | 0.56 | |
|
|
|
|
|
| |
| AAC | 0.54 | 0.57 | 0.54 | 0.56 | |
| PseAAC | 0.53 | 0.55 | 0.52 | 0.55 | |
| 2grams | 0.53 | 0.56 | 0.49 | 0.55 | |
| QRC | 0.50 | 0.52 | 0.43 | 0.51 | |
| Liu’s dev (HQI8) | 0.55 | 0.58 | 0.55 | 0.60 | |
| Liu’s dev (original) | 0.56 | 0.59 | 0.57 | 0.59 |
Figure 5Relative importances of individual features in level-I predictor feature vector.
Secondary structure patterns for interacting and non-interacting fragments.
a denotes structural motive of the i-th residue of one fragment, b denotes structural motive of the i-th residue of the other fragment.
| Pattern | Interacting | Non-interacting |
|---|---|---|
| 0.13 | 0.09 | |
| 0.04 | 0.05 | |
| 0.16 | 0.17 | |
| 0.06 | 0.03 | |
| 0.08 | 0.09 | |
| 0.23 | 0.30 | |
| 0.17 | 0.08 | |
| 0.03 | 0.04 | |
| 0.13 | 0.17 | |
| 0.14 | 0.03 | |
| 0.09 | 0.09 | |
| 0.18 | 0.31 | |
| 0.15 | 0.08 | |
| 0.06 | 0.05 | |
| 0.18 | 0.18 | |
| 0.05 | 0.03 | |
| 0.08 | 0.10 | |
| 0.21 | 0.31 |