| Literature DB >> 30442086 |
Wajid Arshad Abbasi1,2,3, Amina Asif1, Asa Ben-Hur4, Fayyaz Ul Amir Afsar Minhas5.
Abstract
BACKGROUND: Determining protein-protein interactions and their binding affinity are important in understanding cellular biological processes, discovery and design of novel therapeutics, protein engineering, and mutagenesis studies. Due to the time and effort required in wet lab experiments, computational prediction of binding affinity from sequence or structure is an important area of research. Structure-based methods, though more accurate than sequence-based techniques, are limited in their applicability due to limited availability of protein structure data.Entities:
Keywords: Machine learning; Privileged information; Protein binding affinity prediction; Protein-protein interactions
Mesh:
Substances:
Year: 2018 PMID: 30442086 PMCID: PMC6238365 DOI: 10.1186/s12859-018-2448-z
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1A framework to classify protein complexes based on their binding affinities with the paradigm of learning using privileged information (LUPI). Privileged information (3D structural information) is only required at training time (left panel) to help improve performance at test time using sequence information alone (right panel)
Protein complex classification results obtained using classical SVM, Random Forest and XGBoost using input and privileged features with LOCO cross-validation over the affinity benchmark dataset
| Features | Classical SVM | Random forest | XGBoost | ||||||
|---|---|---|---|---|---|---|---|---|---|
| ROC | PR | Sr | ROC | PR | Sr | ROC | PR | Sr | |
| Input space | |||||||||
| 2-mer |
|
|
| 0.68 | 0.63 | − 0.38 |
|
|
|
| Blosum (Protein) | 0.70 | 0.63 | −0.36 |
|
|
| 0.69 | 0.63 | −0.34 |
| Privileged space | |||||||||
| NIRP |
|
|
|
|
|
|
|
|
|
| Moal descriptors | 0.73 | 0.68 | −0.43 | 0.70 | 0.68 | −0.37 | 0.71 | 0.68 | −0.34 |
| Dias descriptors | 0.72 | 0.69 | −0.42 | 0.69 | 0.69 | −0.37 | 0.71 | 0.67 | −0.34 |
| Blosum (Interface) | 0.61 | 0.60 | −0.19 | 0.56 | 0.54 | −0.11 | 0.66 | 0.59 | −0.25 |
Bold faced values indicate best performance for each model. Blosum (Protein) refer to Blosum substitution scores averaged over the protein, while Blosum (Interface) are Blosum substitution scores averaged over the interface. Moal descriptors are taken from Moal et al. [8], and Dias descriptors are taken from Dias and Kolaczkowski [11]
ROC Area under the ROC curve, PR Area under the precision-recall curve, S Spearman correlation coefficient
Protein complex classification results obtained through classical SVM and LUPI across different features using LOCO cross-validation over the affinity benchmark dataset
| Input features | ||||||
| 2-mer | Blosum (Protein) | |||||
| ROC | PR |
| ROC | PR |
| |
| Classical SVM | ||||||
|
|
|
| 0.70 | 0.63 | −0.36 | |
| Privileged features | LUPI-SVM | |||||
| NIRP | 0.76 | 0.71 | −0.47 | 0.74 | 0.70 | −0.42 |
| Moal descriptors |
|
|
|
|
|
|
| Dias descriptors | 0.74 | 0.70 | −0.45 | 0.73 | 0.69 | −0.40 |
| Blosum (Interface) | 0.73 | 0.69 | −0.41 | 0.73 | 0.69 | −0.42 |
Bold faced values indicate best performance for each model
ROC Area under the ROC curve, PR Area under the precision-recall curve, S Spearman correlation coefficient
Fig. 2ROC curves showing a performance comparison between LUPI-SVM (with 2-mers as input-space features and Moal Descriptors as the privileged features) and the baseline classifiers (XGBoost, classical SVM (SVM), and Random Forest (RF) with 2-mer features). The average area under the ROC curve (AUC) is shown in parenthesis
Comparison of classical SVM and LUPI-SVM on the external independent validation dataset with training on affinity benchmark dataset
| Input features | ||||||
| 2-mer | Blosum (Protein) | |||||
| ROC | PR |
| ROC | PR |
| |
| Classical SVM | ||||||
| 0.63 | 0.38 | − 0.28 | 0.61 | 0.39 | −0.19 | |
| Privileged features | LUPI-SVM | |||||
| NIRP | 0.66 | 0.42 | −0.30 | 0.64 | 0.40 | −0.28 |
| Moal descriptors |
|
|
|
|
|
|
| Dias descriptors | 0.65 | 0.41 | −0.29 | 0.64 | 0.44 | −0.20 |
| Blosum (Interface) | 0.64 | 0.40 | −0.26 | 0.64 | 0.46 | −0.22 |
Bold faced values indicate best performance for each model
ROC Area under the ROC curve, PR Area under the precision-recall curve, S Spearman correlation coefficient
Fig. 3Feature analysis using SHAP. The impact of 2-mer features on model output is shown using SHAP values. The plot shows the top 20 2-mers for the Ligand (L) or Receptor (R) by the sum of their SHAP values over all samples. Feature value is shown in color (Red: High; Blue: Low) reveals for example that a high value of L (EK) (Counts of ‘EK’ in a protein sequence designated as ligand) contributes more for predicting low binding affinity complexes
Fig. 4Weight vectors of the trained classifiers for the ligand Blosum features. a SVM with LUPI framework using Blosum substitution features computed over each protein as input and Moal Descriptors as privileged features; b Classical SVM using Blosum features
Fig. 5Training algorithm for LUPI-SVM with stochastic sub-gradient optimization
Fig. 6Number of interacting residue pairs (NIRP) in the interface of a protein complex. The frequency of non-repeating pairs (considering A: B and B: A the same) was computed from the bound 3D structures of ligand (L) and receptor (R) of a protein complex. Residues (shown as spheres) at a distance cutoff of 8 Å are considered the interface of the complex. The bottom panel of the figure shows the form of the feature vector extracted using this scheme