| Literature DB >> 27857563 |
Naoshi Fukuhara1, Nobuhiro Go2, Takeshi Kawabata3.
Abstract
Protein-protein interactions support most biological processes, and it is important to find specifically interacting partner proteins among homologous proteins in order to elucidate cellular functions such as signal transduction systems. Various high-throughput experimental methods for identifying these interactions have been invented, and used to generate a huge amount of data. Because these experiments have been applied to only a few organisms, and their accuracy is believed to be limited, it would be valuable to develop computational methods for predicting protein-protein interactions from their amino acid sequences or tertiary structural information. In this study, we describe a prediction method of interacting proteins based on homology-modeled complex structures. We employed the statistical residue-residue contact energy used in a previous study, and two types of new scores, simple electrostatic energy and sequence similarity between target sequences and template structures. The validity of each protein-protein complex model was measured using their single and combined scores. We applied our method to all the protein heterodimers of Saccharomyces cerevisiae. To evaluate the prediction performance of our method, we prepared two types of protein-protein interaction dataset: a complete dataset and high confidence dataset. The complete dataset (10,325 protein dimer models) contains all the yeast protein heterodimers whose complex structures can be modeled. Among them, pairs registered in the DIP database are defined as interacting pairs, and those not registered are defined as non-interacting protein pairs. The high confidence dataset (3,219 protein dimer models) is a more reliable subset of the complete dataset extracted using the criteria of the common subcellular localization. Both datasets show that sequence similarity has a much higher discrimination power than the other structure-based scores, but that the inclusion of contact energy results in significant improvement over predictions using sequence similarity alone. These results suggest that the sequence similarity is indispensable for the prediction, whereas structure scores can play supporting roles.Entities:
Keywords: binding specificity; contact energy; homology modeling; protein-protein interaction; sequence similarity
Year: 2007 PMID: 27857563 PMCID: PMC5036659 DOI: 10.2142/biophysics.3.13
Source DB: PubMed Journal: Biophysics (Nagoya-shi) ISSN: 1349-2942
Figure 1Residue-residue statistical contact energy in protein-protein interfaces. In the horizontal and vertical axes, 20 amino acids are arranged in descending order of hydrophobicity. Energy values are represented from red (low energy) to blue (high energy).
Atoms of amino acids where charges can be assigned
| Residue | Atom | Residue | Atom | Residue | Atom |
|---|---|---|---|---|---|
| GLU | OE1 | TRP | CE3 | SER | OG |
| GLU | OE2 | TYR | OH | ILE | CD1 |
| ASP | OD1 | PHE | CZ | MET | CE |
| ASP | OD2 | GLN | OE1 | LEU | CD1 |
| ARG | NH1 | GLN | NE2 | LEU | CD2 |
| ARG | NH2 | ASN | OD1 | VAL | CG1 |
| LYS | NZ | ASN | ND2 | VAL | CG2 |
| HIS | ND1 | CYS | SG | ALA | CB |
| HIS | NE2 | THR | OG1 | PRO | CG |
| TRP | CE2 | THR | CG2 | GLY | CA |
PDB atomic names are shown. These atoms are mainly taken from Shaul and Schreiber’s charge rules (Shaul and Schreiber, 2005). Atoms of proline and glycine have been added; OXT and the N-terminus atom have been removed.
The classification of interacting and non-interacting protein pairs included in the complete dataset by subcellular localization
| Interacting pairs | Non-interacting pairs | ||
|---|---|---|---|
| (i) | Two proteins share at least one common localized compartment | 5,631 | |
| (ii) | Subcellular localization of at least one protein is unknown | 10 | 1,438 |
| (iii) | Two proteins do not share any localized compartments | 27 | |
|
| |||
| Total | |||
The underlined numbers are for the complete dataset; bold numbers are for the high confidence dataset.
Figure 2The protein-protein interaction network of the interacting and non-interacting protein pairs included in the complete dataset. The graph was visualized by Cytoscape50. The nodes correspond to the target proteins; edges correspond to interactions. The interacting protein pairs are shown in red, the non-interacting ones in blue. The proteins including the domains of protein kinase catalytic subunit, WD40-repeat, G proteins, canonical RBD, ankyrin repeat, cyclin are colored green, cyan, red, yellow, gray and black, respectively. If the target protein includes more than two domains from the six types of domains, the node is colored according to the domain nearest to the N-terminus. The SCOP, which is the structural classification database of proteins, was used for identifying the domains51.
Family pairs frequently appearing in template complexes
| Family pairs of the template structures | PDB | Complete dataset | High confidence dataset | ||
|---|---|---|---|---|---|
|
|
| ||||
| Inter | Non-inter | Inter | Non-inter | ||
| Top 10 family pairs of the interacting protein pairs | |||||
|
| |||||
| 1. b.38.1.1/b.38.1.1 | 1b34AB | 33 | 24 | 33 | 0 |
| 2. d.153.1.4/d.153.1.4 | 1g65JK | 30 | 44 | 30 | 0 |
| 3. h.1.15.1/h.1.15.1 | 1gl2BC | 20 | 80 | 10 | 45 |
| 4. c.37.1.20-a.80.1.1/c.37.1.20-a.80.1.1 | 1sxjBC | 19 | 95 | 19 | 15 |
| 5. d.144.1.7/a.74.1.1-a.74.1.1 | 1finAB | 18 | 1662 | 14 | 559 |
| 6. c.3.1.3-d.16.1.6-c.3.1.3/c.37.1.8 | 1ukvGY | 13 | 61 | 12 | 13 |
| 7. d.144.1.7/d.211.1.1 | 1bi7AB | 10 | 1912 | 10 | 381 |
| 8. a.22.1.1/a.22.1.1 | 1id3AF | 9 | 12 | 9 | 0 |
| 9. i.1.1.1/i.1.1.1 | 1s1hJN | 8 | 16 | 8 | 9 |
| 10. a.116.1.1/c.37.1.8 | 1ow3AB | 6 | 342 | 5 | 99 |
|
| |||||
| Top 10 family pairs of the non-interacting protein pairs | |||||
|
| |||||
| 1. d.144.1.7/d.211.1.1 | 1g3nAB | 10 | 1912 | 10 | 381 |
| 2. a.74.1.1-a.74.1.1/d.144.1.7 | 1oiuBC | 18 | 1662 | 14 | 559 |
| 3. c.37.1.8-a.66.1.1-c.37.1.8/b.69.4.1 | 1gotAB | 1 | 530 | 1 | 321 |
| 4. a.116.1.1/c.37.1.8 | 1ow3AB | 6 | 342 | 5 | 99 |
| 5. j.66.1.1/d.144.1.7 | 1f3mAC | 1 | 319 | 1 | 112 |
| 6. c.10.2.4/d.58.7.1 | 1a9nAB | 2 | 257 | 1 | 109 |
| 7. c.37.1.8/c.10.1.2 | 1k5dAC | 4 | 239 | 3 | 87 |
| 8. c.45.1.1/d.144.1.7 | 1fq1AB | 0 | 204 | 0 | 59 |
| 9. a.48.1.1-a.39.1.7-d.93.1.1-g.44.1.1/d.20.1.1 | 1fbvAC | 3 | 189 | 3 | 38 |
| 10. a.118.1.1/c.37.1.8 | 1qbkBC | 4 | 184 | 4 | 44 |
SCOP ID included in the table are following; a.22.1.1:Nucleosome core histones, a.39.1.7:EF-hand modules in multidomain proteins, a.48.1.1:N-terminal domain of cb1 (N-cb1), a.66.1.1:Transducin (alpha subunit) insertion domain, a.74.1.1:Cyclin, a.80.1.1:DNA polymerase III clamp loader subunits C-terminal domain, a.116.1.1:BCR-homology GTPase activation domain (BH-domain), a.118.1.1:Armadillo repeat, b.38.1.1:Sm motif of small nuclear ribonucleoproteins SNRNP, b.69.4.1:WD40-repeat, c.3.1.3:GDI-like N domain, c.10.1.2:Rna1p (RanGAP1) N-terminal domain, c.10.2.4:U2A′-like, c.37.1.8:G proteins, c.37.1.20:Extended AAA-ATPase domain, c.45.1.1:Dual specificity phosphatase-like, d.16.1.6:GDI-like, d.20.1.1:Ubiquitin conjugating enzyme UBC, d.58.7.1:Canonical RBD, d.93.1.1:SH2 domain, d.144.1.7:Protein kinases catalytic subunit, d.153.1.4:Proteasome subunits, d.211.1.1:Ankyrin repeat, g.44.1.1:RING finger domain C3HC4, h.1.15.1:SNARE fusion complex, i.1.1.1:Ribosome complexes, j.66.1.1:pak1 autoregulatory domain.
PDB code of the template complexes.
Number of interacting protein pairs.
Number of non-interacting protein pairs.
Figure 3Distributions of Z-scores of contact energy calculated for protein pairs included in the complete dataset. Black and gray bars correspond to interacting and non-interacting protein pairs respectively.
Figure 4Distributions of Z-score of electrostatic energy calculated for the protein pairs included in the complete dataset.
Figure 5Distributions of Z-score of sequence similarity calculated for the protein pairs included in the complete dataset.
Figure 6Recall-precision plots for discrimination between interacting and non-interacting protein pairs using single and combined scores in the complete dataset. “Con”: contact energy, “Ele”: electrostatic energy, “Seq”: sequence similarity. “Ele+Con”, “Seq+Con”, “Seq+Ele” and “Seq+Ele+Con” correspond to the plots using combined Z-scores. The purple triangle shows the performance of the method of Davis et al.25
Figure 7Recall-precision plots for discrimination between interacting and non-interacting protein pairs using single and combined scores in the high confidence dataset. Abbreviations as in Figure 6.
Figure 8The maximum F-measures with their recall and precision values for each recall-precision plot using single and combined Z-scores in the complete dataset. Abbreviations as in Figure 6. Dotted line: maximum F-measure of sequence similarity alone.
Figure 9The maximum F-measures with their recall and precision values for each recall-precision plot using single and combined Z-scores in case of the high confidence dataset. Abbreviations are the same as those used in Figure 6.