| Literature DB >> 19273533 |
Kyu-il Cho1, Dongsup Kim, Doheon Lee.
Abstract
Identifying features that effectively represent the energetic contribution of an individual interface residue to the interactions between proteins remains problematic. Here, we present several new features and show that they are more effective than conventional features. By combining the proposed features with conventional features, we develop a predictive model for interaction hot spots. Initially, 54 multifaceted features, composed of different levels of information including structure, sequence and molecular interaction information, are quantified. Then, to identify the best subset of features for predicting hot spots, feature selection is performed using a decision tree. Based on the selected features, a predictive model for hot spots is created using support vector machine (SVM) and tested on an independent test set. Our model shows better overall predictive accuracy than previous methods such as the alanine scanning methods Robetta and FOLDEF, and the knowledge-based method KFC. Subsequent analysis yields several findings about hot spots. As expected, hot spots have a larger relative surface area burial and are more hydrophobic than other residues. Unexpectedly, however, residue conservation displays a rather complicated tendency depending on the types of protein complexes, indicating that this feature is not good for identifying hot spots. Of the selected features, the weighted atomic packing density, relative surface area burial and weighted hydrophobicity are the top 3, with the weighted atomic packing density proving to be the most effective feature for predicting hot spots. Notably, we find that hot spots are closely related to pi-related interactions, especially pi . . . pi interactions.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19273533 PMCID: PMC2677884 DOI: 10.1093/nar/gkp132
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
The 17 protein–protein complexes analyzed
| PDB id | First molecule | Second molecule |
|---|---|---|
| 1a4y | RNase inhibitor | Angiogenin |
| 1a22 | Human growth hormone | Human growth hormone binding protein |
| 1ahw | Immunoglobulin Fab5G9 | Tissue factor |
| 1brs | Barnase | Barstar |
| 1bxi | Colicin E9 Immunity Im9 | Colicin E9 DNase |
| 1cbw | BPTI Trypsin inhibitor | Chymotrypsin |
| 1dan | Blood coagulation factor VIIA | Tissue factor |
| 1dvf | Idiotopic antibody FV D1.3 | Anti-idiotopic antibody FV E5.2 |
| 1f47 | Cell division protein ZIPA | Cell division protein FTSZ |
| 1fc2 | Fc fragment | Fragment B of protein A |
| 1fcc | Fc (IGG1) | Protein G |
| 1gc1 | Envelope protein GP120 | CD4 |
| 1jrh | Antibody A6 | Interferon-gamma receptor |
| 1nmb | N9 Neuramidase | Fab NC10 |
| 1vfb | Mouse monoclonal antibody D1.3 | Hen egg lysozyme |
| 2ptc | BPTI | Trypsin |
| 3hfm | Hen Egg Lysozyme | lg FAB fragment HyHEL-10 |
Figure 1.Decision tree analyses for the two training sets, T1 (a) and T2 (b). The trees show that hot spots can be modeled using only 12 features according to their corresponding training sets, although the constituent members of the T1- and T2-derived sets differ slightly. In both feature sets, newly proposed features such as the weighted atomic packing density, relative surface area burial and weighted hydrophobicity are located in the upper level of nodes in the decision tree.
The classification confusion matrices based on the corresponding training sets (T1,T2)
| Condition | |||||||
|---|---|---|---|---|---|---|---|
| T1-training set | T2-training set | ||||||
| Td | Fe | Total | T | F | Total | ||
| MINERVAa | Pb | 70 | 24 | 94 | 38 | 14 | 52 |
| Nc | 49 | 122 | 171 | 27 | 186 | 213 | |
| Total | 119 | 146 | 265 | 65 | 200 | 265 | |
| Robetta | P | 74 | 38 | 112 | 32 | 20 | 52 |
| N | 45 | 108 | 153 | 33 | 180 | 213 | |
| Total | 119 | 146 | 265 | 65 | 200 | 265 | |
| FOLDEF | P | 57 | 20 | 77 | 19 | 15 | 34 |
| N | 62 | 126 | 188 | 46 | 185 | 231 | |
| Total | 119 | 146 | 265 | 65 | 200 | 265 | |
| KFC | P | naf | na | na | 36 | 26 | 62 |
| N | na | na | na | 29 | 174 | 203 | |
| Total | na | na | na | 65 | 200 | 265 | |
aMINERVA, an acronym of MINE Residue VAlue, is the name of our model.
bPositive.
cNegative.
dTrue.
eNegative.
fKFC is only designed to predict hot spots with ΔΔG ≥ 2 kcal/mol, so it is not included in the analysis for the T1 set.
Each column represents the gold standard, and each row represents the class predicted by the model.
Evaluation of the hot spot prediction with T2 using 10-fold cross-validation
| KFC | Robetta | FOLDEF | MINERVA2 | |
|---|---|---|---|---|
| SN | 0.55 | 0.49 | 0.32 | 0.58 |
| SP | 0.85 | 0.90 | 0.93 | 0.89 |
| PPV | 0.58 | 0.62 | 0.59 | 0.73 |
| NPV | 0.88 | 0.84 | 0.81 | 0.87 |
| F1 Score | 0.56 | 0.55 | 0.41 | 0.65 |
| ΔF1 | ** | – | – | 0.09 |
| ** | – | – | 0.01 |
aSensitivity (Recall).
bSpecificity.
cPositively predicted value (Precision).
dNegatively predicted value.
eThe performance of our model using the 2 kcal/mol training set, T2.
fMINERVA, an acronym of MINE Residue VAlue, is the name of our model.
Evaluation of the hot spot prediction for each model with the independent test set
| Robetta | KFC | FOLDEF | MINERVA2 | MINERVA1 | |
|---|---|---|---|---|---|
| SN | 0.33 | 0.31 | 0.26 | 0.44 | 0.62 |
| SP | 0.87 | 0.85 | 0.88 | 0.90 | 0.76 |
| PPV | 0.52 | 0.48 | 0.48 | 0.65 | 0.53 |
| NPV | 0.73 | 0.74 | 0.73 | 0.78 | 0.82 |
| F1 Score | 0.40 | 0.37 | 0.34 | 0.52 | 0.57 |
| ΔF1 | ** | – | – | 0.12 | 0.17 |
| ** | – | – | 4.31 × 10−3 | 4.55 × 10−4 |
aSensitivity (or recall).
bSpecificity.
cPositively predicted value (or precision).
dNegatively predicted value.
eThe performance of our model trained with the 2 kcal/mol training set, T2.
fThe performance of our model trained with the 1 kcal/mol training set, T1.
MINERVA. an acronym of MINE Residue VAlue, is the name of our model.
Figure 2.(a) Weighted atomic packing density in the bound state. (b) Coordination number in the bound state. The weighted atomic packing density is compared with the coordination number using a histogram. The average value of the weighted atomic packing density for hot spots is quite different from that for other residues, irrespective of the ΔΔG cutoff value. In contrast, the coordination number does not differ between hot spots and other residues. This result is supported by statistical analysis (Table 5).
Weighted atomic packing density versus Coordination number
| Density type | ΔΔG cutoff value (kcal/mol) | Mann–Whitney U-test, | Hot spots |
|---|---|---|---|
| W | 1.0 | 3.32 × 10−11 | 119/146 |
| W | 2.0 | 2.22 × 10−12 | 65/200 |
| CN | 1.0 | 0.10 | 119/146 |
| CN | 2.0 | 0.02 | 65/200 |
aWeighted atomic packing density in the bound state.
bCoordination number in the bound state, defined as the number of Cα within 6.5 Å around each residue (42).
cNumber of hot spots.
dNumber of energetically unimportant residues.
Structural comparison between the unbound and bound states for various proteins using combinatorial extension
| Bound state | Unbound state | RMSD(Å) | Seq. | |||
|---|---|---|---|---|---|---|
| PDB id | Chain id | PDB id | Chain id | identity (%) | ||
| Angiogenin | 1a4y | B | 1un3 | A | 0.76 | 99.1 |
| hGH | 1a22 | A | 1hgu | – | 2.68 | 68.4 |
| Tissue factor | 1ahw | C | 1tfh | A | 1.39 | 100.0 |
| Barnase | 1brs | A | 1bnf | A | 1.12 | 98.1 |
| Barstar | 1brs | D | 1a19 | A | 0.44 | 98.9 |
| BPTI | 1cbw | D | 1bpt | – | 0.39 | 98.2 |
| Tissue factor | 1dan | T | 1tfh | A | 0.63 | 100.0 |
| RNase inhibitor | 1dfj | I | 2bnh | – | 1.50 | 100.0 |
| CD4 | 1gc1 | C | 1cdj | A | 1.09 | 100.0 |
| Hen egg lysozyme | 1vfb | C | 1lyz | – | 1.11 | 100.0 |
| Trypsin | 2ptc | I | 1bpt | – | 0.36 | 98.2 |
| Lysozyme | 3hfm | Y | 1lyz | – | 0.67 | 100.0 |
aA protein is in the bound state.
bThese 12 proteins are used for statistical analysis to prove that hot spots already have densely structured organization in the unbound state
cRoot mean squared error.
d‘–’ represents that chain id is not presented in the PDB.
e1ahw C and 1dan T are the same proteins but have different binding sites, and in the same way, 1vfb C, and 3hfm Y have different binding sites.
Figure 3.(a) The weighted atomic packing density of the hot spots in the unbound state is much higher than the weighted atomic packing density in the rest of the interface. (b) The hot spots are much denser than other residues in the bound state, irrespective of the ΔΔG cutoff value. (c) The difference in weighted atomic packing density between before and after binding association (ΔWAD) is large.
P-values for comparisons of distributions of weighted atom densities for energetically different residue types
| Density type | ΔΔG cutoff value (kcal/mol) | Mann–Whitney U-test, P values | Hot spots |
|---|---|---|---|
| 2.0 | 4.24 × 10−11 | 25/107 | |
| 1.0 | 3.93 × 10−11 | 49/83 | |
| 2.0 | 5.79 × 10−13 | 25/107 | |
| 1.0 | 2.21 × 10−8 | 49/83 | |
| Δ | 2.0 | 1.88 × 10−8 | 25/107 |
| 1.0 | 3.34 × 10−4 | 49/83 |
aW : weighted atomic packing density in the bound form.
bW : weighted atomic packing density in the unbound form.
cΔW : the difference in weighted atom packing density between before and after binding association.
dNumber of hot spots in the data set.
eNumber of energetically unimportant residues in the data set.
Figure 4.The interactions between densely packed hot spots in 3hfm. D32, Y33 and Y53 in chain H (HyHEL-10), interact with L75, K97, D101 in chain Y (lysozyme). Y53 and Y58 in chain H interact with Y96 in chain L (HyHEL-10) and R21 in chain Y. Y20 and K96 in chain Y interact with N31, N32, Y50 and Q53 in chain L. The blue spheres represent the densely packed hot spots in chain L (HyHEL-10), and the yellow spheres indicate the highly packed hot spots in chain H (HyHEL-10). The green spheres are the highly packed hot spots in chain Y (Hen egg lysozyme), and the white sticks represent the energetically unimportant residues. The images are created by program PyMol (59).
Figure 5.(a) ΔΔG 1.0 kcal/mol cutoff value. (b) ΔΔG 2.0 kcal/mol cutoff value. Relative ratio between hot spot residues and energetically unimportant residues as a function of the weighted hydrophobicity. As the hydrophobicity increases, the fraction of residues that are hot spots increases.
P-values from comparisons of distributions of conservation score for energetically different types of residues
| ΔΔG cutoff value (kcal/mol) | Including antibody–antigens | Excluding antibody–antigens | |||
|---|---|---|---|---|---|
| Mann–Whitney U-test | Hot spots | Mann–Whitney U-test | Hot spots/Others | ||
| VNE | 1.0 | 0.79 | 119/146 | 5.10 × 10−3 | 56/105 |
| 2.0 | 0.05 | 65/200 | 5.36 × 10−4 | 32/129 | |
aNumber of hot spots.
bNumber of energetically unimportant residues.