| Literature DB >> 21589874 |
Nan Zhao1, Bin Pang, Chi-Ren Shyu, Dmitry Korkin.
Abstract
Interactions between proteins play a key role in many cellular processes. Studying protein-protein interactions that share similar interaction interfaces may shed light on their evolution and could be helpful in elucidating the mechanisms behind stability and dynamics of the protein complexes. When two complexes share structurally similar subunits, the similarity of the interaction interfaces can be found through a structural superposition of the subunits. However, an accurate detection of similarity between the protein complexes containing subunits of unrelated structure remains an open problem. Here, we present an alignment-free machine learning approach to measure interface similarity. The approach relies on the feature-based representation of protein interfaces and does not depend on the superposition of the interacting subunit pairs. Specifically, we develop an SVM classifier of similar and dissimilar interfaces and derive a feature-based interface similarity measure. Next, the similarity measure is applied to a set of 2,806×2,806 binary complex pairs to build a hierarchical classification of protein-protein interactions. Finally, we explore case studies of similar interfaces from each level of the hierarchy, considering cases when the subunits forming interactions are either homologous or structurally unrelated. The analysis has suggested that the positions of charged residues in the homologous interfaces are not necessarily conserved and may exhibit more complex conservation patterns.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21589874 PMCID: PMC3093400 DOI: 10.1371/journal.pone.0019554
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1A protocol for obtaining a reliable set of similar and dissimilar interface pairs.
First, two structure-based similarity measures, iiRMSD and siRMSD, are evaluated on a dataset collected from 3D Complex database. Second, a non-redundant domain-domain interaction data set is obtained from PDB, SCOP and CATH. Third, iiRMSD is used to classify positive (similar) and negative (dissimilar) training sets of pairs of interaction interface structures.
Positive and negative datasets.
| Dataset | Subsets | NIP | Total | Threshold |
| Positive set |
| 372 | 852 |
|
|
| 480 | |||
| Negative set |
| 723 | 1322 | 15 Å< |
|
| 599 |
N is the number of interface pairs from each subset of the positive and negative datasets after the RMSD thresholds are applied. Total is the number of pairs in each dataset. iiRMSD is used to define an upper threshold for the positive set (8 Å) as well as the lower and upper thresholds for the negative set (15 Å and 25 Å). The thresholds are imposed to minimize the number of false positives and negatives.
Figure 2An overview of machine learning approach to determine interface similarity measure.
First, interface structures are extracted from the training sets of similar and dissimilar interaction interfaces. Second, for each pair of interfaces a 106-dimensional feature vector is calculated. Third, a Support Vector Machines classifier is trained and evaluated using the above datasets. Last, a protein interface similarity measure δ(I is defined for two interfaces, I and I, as the distance between the corresponding106-dimensional feature vector and the separating hyperplane.
Amino acid residue classes according to their physicochemical properties.
| Aliphatic | Aromatic | Positive | Negative | Small | Hydrophobic | Polar | |
| ALA | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| ARG | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| ASN | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| ASP | 0 | 0 | 0 | 1 | 1 | 0 | 1 |
| GYS | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| GLU | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
| GLN | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| GLY | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| HIS | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
| ILE | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| LEU | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| LYS | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
| MET | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| PHE | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
| PRO | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| SER | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| THR | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
| TRP | 0 | 1 | 0 | 0 | 0 | 1 | 1 |
| TYR | 0 | 1 | 0 | 0 | 0 | 1 | 1 |
| VAL | 1 | 0 | 0 | 0 | 1 | 1 | 0 |
Six classes of residues were defined, where a residue may belong to more than one class.
Figure 3Hierarchical classification of interaction interfaces.
Similar shapes correspond to homologous proteins. Three levels of structurally similar interaction interfaces are defined. A single cluster at H-level, C-level, and A-level can include homologous, common partner analogous, and analogous interfaces, correspondingly.
Figure 4Histograms of the distributions of (A) iiRMSD and (B) siRMSD values on the datasets of similar and dissimilar interfaces.
Both datasets are obtained from 3D Complex database. On average, the dissimilar interface pairs had larger iiRMSD and siRMSD values (mean values are 20.6 and 15.8, correspondingly) than similar pairs (mean values are 14.8 and 14.7). In addition, the mean value difference between the similar and dissimilar interfaces was larger when using the iiRMSD measure (Δμ is 4.7 for iiRMSD and 1.1 for siRMSD).
Figure 5Distribution of SCOP class ID pairs from the training dataset of protein-protein interactions.
The dataset covers all SCOP class IDs, while the uneven distribution of the pairs is consistent with the unevenness in the overall distribution of protein structures across the SCOP classes.
Leave-one-out cross validation of two SVM models.
|
|
| |||||
| Kernel | Acc | Pre | Rec | Acc | Pre | Rec |
| RBF | 92.6% | 93.7% | 93.7% | 77.4% | 74.6% | 64.1% |
| Polynomial | 92.0% | 92.8% | 93.7% | 76.5% | 70.1% | 69.6% |
Model is trained on Positive, Positive, and Negative. Model is trained using the same positive set, and a negative set that includes Negative together with Negative. Accuracy (Acc), precision (Pre), and recall (Rec) were calculated for both kernerls, RBF and Polynomial.
Top 20 ranked features for both SVM models.
| Model No. 1 | Model No. 2 | ||
| Feature ID | Description of features | Feature ID | Description of features |
| 105 | difference of number of contacts between two interfaces | 105 | difference of number of contacts between two interfaces |
| 29 | ASA of first interface | 30 | planarity of first interface |
| 81 | ASA of second interface | 64 | number of Aromatic-Hydrophobic contacts in the second interface |
| 30 | planarity of first interface | 76 | number of Small-Hydrophobic contacts in the second interface |
| 64 | number of Aromatic-Hydrophobic contacts in the second interface | 29 | ASA of first interface |
| 53 | number of Aliphatic-Aliphatic contacts in the second interface | 83 | protrusion of the second interface |
| 71 | number of Negative-Negative contacts in the second interface | 21 | number of Negative-Hydrophobic contacts in the first interface |
| 82 | planarity of second interface | 44 | ratio of Asn hotspots in the first interface |
| 28 | number of Polar-Polar contacts in the first interface | 16 | number of Positive-Small contacts in the first interface |
| 69 | number of Positive-Hydrophobic contacts in the second interface | 34 | ratio of Cys hotspots in the first interface |
| 86 | ratio of Cys hostspots in the second interface | 50 | ratio of Ile hotspots in the first interface |
| 92 | ratio of Phe hotspots in the second interface | 73 | number of Negative-Hydrophobic contacts in the second interface |
| 90 | ratio of Tyr hotspots in the second interface | 106 | difference of ASA between two interfaces |
| 73 | number of Negative-Hydrophobic contacts in the second interface | 19 | number of Negative-Negative contacts in the second interface |
| 74 | number of Negative-Polar contacts in the second interface | 11 | number of Aromatic-Small contacts in the second interface |
| 62 | number of Aromatic-Negative contacts in the second interface | 100 | ratio of Thr hotspots in the second interface |
| 67 | number of Positive-Negative contacts in the second interface | 68 | number of Positive-Small contacts in the second interface |
| 58 | number of Aliphatic-Hydrophobic contacts in the second interface | 98 | ratio of Glu hotspots in the second interface |
| 97 | ratio of Lys hotspots in the second interface | 39 | ratio of Gln hotspots in the first interface |
| 56 | number of Aliphatic-Negative contacts in the second interface | 33 | ratio of Trp hotspots in the first interface |
The ranking was obtained using the SVM attribute evaluating protocol implemented in Weka software package.
Minimum, Maximum, and Median of feature values for top 20 ranked features for both SVM models.
| Model No. 1 | Model No. 2 | ||||||||||||
| Positive set | Negative set | Positive set | Negative set | ||||||||||
| ID | Min | Max | Med | Min | Max | Med | ID | Min | Max | Med | Min | Max | Med |
| 105 | 0.00 | 328.00 | 35.00 | 2.00 | 732.00 | 130.00 | 105 | 0.00 | 328.00 | 35.00 | 0.00 | 732.00 | 103.00 |
| 29 | 35.40 | 146.80 | 51.20 | 31.90 | 168.40 | 69.90 | 30 | 1.48 | 8.16 | 4.59 | 0.48 | 12.90 | 4.15 |
| 81 | 0.00 | 0.23 | 0.11 | 0.05 | 0.19 | 0.11 | 64 | 0.00 | 0.14 | 0.03 | 0.00 | 0.37 | 0.02 |
| 30 | 0.00 | 0.36 | 0.09 | 0.00 | 1.00 | 0.13 | 76 | 0.00 | 0.50 | 0.08 | 0.00 | 0.20 | 0.08 |
| 64 | 0.00 | 0.14 | 0.03 | 0.00 | 0.09 | 0.02 | 29 | 35.30 | 146.80 | 51.10 | 31.90 | 168.40 | 61.30 |
| 53 | 0.00 | 0.09 | 0.01 | 0.00 | 0.12 | 0.01 | 83 | 0.00 | 55.40 | 4.49 | 0.00 | 55.40 | 4.49 |
| 71 | 0.00 | 0.09 | 0.01 | 0.00 | 0.04 | 0.01 | 21 | 0.00 | 0.09 | 0.01 | 0.00 | 0.33 | 0.02 |
| 82 | 1.08 | 9.03 | 4.52 | 3.60 | 8.45 | 5.13 | 44 | 0.00 | 0.33 | 0.04 | 0.00 | 1.00 | 0.03 |
| 28 | 0.00 | 0.36 | 0.09 | 0.00 | 1.00 | 0.13 | 16 | 0.00 | 0.15 | 0.02 | 0.00 | 0.30 | 0.02 |
| 69 | 0.00 | 0.08 | 0.01 | 0.00 | 0.05 | 0.01 | 34 | 0.00 | 0.27 | 0.00 | 0.00 | 0.33 | 0.00 |
| 86 | 0.00 | 0.27 | 0.00 | 0.00 | 0.19 | 0.00 | 50 | 0.00 | 0.33 | 0.06 | 0.00 | 1.00 | 0.04 |
| ß92 | 0.00 | 0.37 | 0.05 | 0.00 | 0.18 | 0.04 | 73 | 0.00 | 0.17 | 0.01 | 0.00 | 0.28 | 0.02 |
| 90 | 0.00 | 1.00 | 0.07 | 0.00 | 0.31 | 0.07 | 106 | 0.01 | 84.40 | 6.37 | 0.01 | 122.10 | 17.90 |
| 73 | 0.00 | 0.17 | 0.01 | 0.00 | 0.09 | 0.02 | 19 | 0.00 | 0.05 | 0.00 | 0.00 | 0.12 | 0.00 |
| 74 | 0.00 | 0.16 | 0.02 | 0.00 | 0.09 | 0.02 | 11 | 0.00 | 0.17 | 0.02 | 0.00 | 0.20 | 0.01 |
| 62 | 0.00 | 0.07 | 0.00 | 0.00 | 0.04 | 0.01 | 100 | 0.00 | 0.33 | 0.06 | 0.00 | 0.50 | 0.06 |
| 67 | 0.00 | 0.08 | 0.01 | 0.00 | 0.05 | 0.01 | 68 | 0.00 | 0.19 | 0.02 | 0.00 | 0.22 | 0.02 |
| 58 | 0.00 | 0.27 | 0.04 | 0.00 | 0.14 | 0.03 | 98 | 0.00 | 0.55 | 0.06 | 0.00 | 0.50 | 0.0ß7 |
| 97 | 0.00 | 0.33 | 0.05 | 0.00 | 0.30 | 0.08 | 39 | 0.00 | 0.28 | 0.04 | 0.00 | 1.00 | 0.03 |
| 56 | 0.00 | 0.07 | 0.00 | 0.00 | 0.05 | 0.01 | 33 | 0.00 | 0.25 | 0.00 | 0.00 | 0.50 | 0.00 |
For each of the top 20 ranked features (ID stands for the feature ID), the minimum (Min), maximum (Max), and median (Med) values were individually calculated for the positive and negative sets.
Comparison of SCOPPI, PRISM with Model and Model.
| Dataset | Classified | SCOPPI | Prism |
|
|
| H-level | Similar | 48.0% | 15.9% | 98.1% | 75.0% |
| Dissimilar | 51.0% | 3.2% | 1.88% | 25.0% | |
| Unknown | 1.0% | 80.9% | 0.0% | 0.0% | |
| Dissimilar native-native | Similar | 0.0% | 0.0% | - | 33.6% |
| Dissimilar | 98.1% | 6.6% | - | 66.4% | |
| Unknown | 1.9% | 93.4% | - | 0.0% |
The accuracies for each classifier were calculated using homologous interfaces from the positive set and dissimilar native-native interfaces from the negative sets. The results for Model and Model were based on the leave-one-out cross-validation. Unknown classification results refer to the percentage of those interface pairs that were not classified by either SCOPPI or Prism.
Figure 6Average Silhouette value against different number of clusters (K).
An obvious knee point (K = 140) is selected as the number of clusters.
A three-level hierarchy obtained using the new feature-based interface similarity measure.
| Level | Clusters | Avg | Min | Max | 1-member |
| H | 2,085 | 1.4 | 1 | 9 | 1,610 |
| C | 1,892 | 1.5 | 1 | 13 | 1,363 |
| A | 140 | 20.0 | 3 | 83 | 0 |
For each level, the number of clusters (Clusters), the average, minimum, and maximum numbers of members per cluster (Avg, Min, and Max), and the number of clusters with one member (1-member) were calculated.
Figure 7Case studies of similar interactions.
(A) H-level interactions (iiRMSD = 2.93 Å), (B) C-level interactions (iiRMSD = 6.12 Å), and (C) A-level interactions (iiRMSD = 6.19 Å). Subunits from the first interaction together with the corresponding interface and binding sites are colored gold and light yellow. Subunits from the second interaction (and their interfaces and binding sites) are colored dark and light grey. Positively and negatively charged residues in the first interaction are colored blue and red, while in the second interaction they are colored cyan and magenta, correspondingly. Superposition refers to the superposed interactions, interfaces, and binding sites.