Literature DB >> 25740680

Interolog interfaces in protein-protein docking.

Abstract

Proteins are essential elements of biological systems, and their function typically relies on their ability to successfully bind to specific partners. Recently, an emphasis of study into protein interactions has been on hot spots, or residues in the binding interface that make a significant contribution to the binding energetics. In this study, we investigate how conservation of hot spots can be used to guide docking prediction. We show that the use of evolutionary data combined with hot spot prediction highlights near-native structures across a range of benchmark examples. Our approach explores various strategies for using hot spots and evolutionary data to score protein complexes, using both absolute and chemical definitions of conservation along with refinements to these strategies that look at windowed conservation and filtering to ensure a minimum number of hot spots in each binding partner. Finally, structure-based models of orthologs were generated for comparison with sequence-based scoring. Using two data sets of 22 and 85 examples, a high rate of top 10 and top 1 predictions are observed, with up to 82% of examples returning a top 10 hit and 35% returning top 1 hit depending on the data set and strategy applied; upon inclusion of the native structure among the decoys, up to 55% of examples yielded a top 1 hit. The 20 common examples between data sets show that more carefully curated interolog data yields better predictions, particularly in achieving top 1 hits. Proteins 2015; 83:1940-1946.

© 2015 The Authors. Proteins: Structure, Function, and Bioinformatics Published by Wiley Periodicals, Inc. © 2015 The Authors. Proteins: Structure, Function, and Bioinformatics Published by Wiley Periodicals, Inc.

Entities: Chemical Disease Gene Species

Keywords: hot spot; interolog; molecular evolution; mutagenesis; ortholog; protein-protein docking

Mesh：

Substances：
Proteins

Year: 2015 PMID： 25740680 PMCID： PMC5054918 DOI： 10.1002/prot.24788

Source DB: PubMed Journal: Proteins ISSN： 0887-3585

INTRODUCTION

Protein interactions form the essential language of biochemistry, and the ability of molecules to communicate via binding interactions governs most physiological processes. Predictive models for protein docking help to uncover biological mechanisms, explain disease polymorphisms, and facilitate the design of therapeutic agents. The ability to describe mutagenesis effects at the structural level and connect them to effects on molecular interactions provides a direct connection between structure and systems biology, which is an underexplored area of computational modeling. A number of factors, including but not limited to the size and shape of each protein, electrostatic forces, and hydrophobicity determine the binding affinity of a protein complex. The binding interface regions are essential to the formation of a protein–protein complex, and as a result they have been shown to be more conserved than other surface‐exposed regions of proteins.1 Within these binding interface regions, there is a subset of residues known as “hot spots.” These residues are believed to make the largest energetic contributions to binding affinity, and thus hot spot interactions are selectively conserved over the course of molecular evolution. Experimental identification of hot spots has been pursued using alanine‐scanning mutagenesis. Recently, there has been an increase in the development of computational methods that simulate these alanine‐scanning experiments. Hot spot prediction algorithms use a variety of approaches, ranging from molecular dynamics simulations, energetic analysis, and machine learning.2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42 These programs have proven to be accurate and effective models for predicting interface hot spots. In our study, the KFC2 method was used to predict hot spots, as it has proven to be highly predictive in comparison with other models.43 Our goal is to use the evolutionary conservation of important hot spot interactions toward the identification of near‐native protein–protein interfaces generated by low‐resolution exhaustive search algorithms. Proteins sharing a common evolutionary ancestor and similar function in different species are called orthologs. As a protein evolves over time and across species, its amino acid composition may be altered due to mutation, but its function is often maintained due to selection pressures. Here, we explore the use of predictive software to identify hot spots within pairs of interacting orthologs (known as interologs) using the sequence and spatial conservation of hot spots across species to help identify near‐native docking predictions. The application of in silico mutagenesis in docking has often focused on specific cases rather than as a general mechanism for scoring docking predictions.44, 45, 46, 47, 48, 49, 50, 51, 52, 53 Some recent work has pursued statistical potentials based on evolutionary data54 and genome‐wide studies of protein interactions based on hot spots,55 while other approaches have used evolutionary data to guide sampling within docking algorithms.56, 57 Our study suggests that structure‐based prediction of hot spots combined with evolutionary conservation is able to identify near‐native predictions for a range of systems.

MATERIALS AND METHODS

To select protein–protein interactions for our study, we used a subset of the Protein Docking Benchmark58 for which reliable information on interologs could be obtained. Data Set 1 was constructed using the 3D‐Interologs database.59 Each complex was checked to ensure that the orthologs were from the same protein family and carried out the same or similar biological function(s). Immune‐system proteins were avoided, as these types of proteins are subject to different evolutionary pressures than other classes of molecules. Furthermore, selected examples needed to have at least four interologs in order to ensure statistically significant results. In the end, based on these parameters we were able to collect data on 22 different protein–protein complexes. Data Set 2 was taken from the article of Andreani et al.54 using multiple sequence alignments extracted from the InterEvol database.60 Using MUSCLE,61 each set of orthologous sequences was aligned with the sequences of the protein structures used to generate docking decoys. We added amino acids missing in the crystal structures at both the alignment and structure refinement stage. A description of examples in Data Sets 1 and 2 can be found in Supporting Information Tables S1 and S2, respectively. The relative distance between the docking decoys and the native conformations was measured using the interface RMSD (iRMSD) value, as reported in the ZDOCK results for the Docking Benchmark Set. Decoys with relatively low interface RMSD values (< 2.5 Angstroms) were considered near‐native hits, while those with higher RMSD values were considered non‐native. For our sample, we used 1,000 decoys selected from the list of 54,000 results produced by ZDOCK using fine sampling. We used the top 100 structures as ranked by ZDOCK along with 900 randomly sampled structures from the remainder. Refinements of near‐native and non‐native decoy structures were performed with MODELLER,62 which uses comparative modeling to predict 3D models of proteins using the amino acid sequence and a known template structure. This procedure helped to resolve any clashes and small conformational changes resulting from induced fit. Once the interface models had been generated, the KFC2a model43 was used to predict the hot spots for the refined decoy structures. Note that in a few cases, hot spots could not be computed for some decoys due to errors produced by NACCESS,63 which is used by KFC2a to calculate accessible surface area; analysis on these examples includes all decoys for which hot spot data are available, with other structures effectively classified as non‐native. As well, structures that did not generate any predicted hot spots were necessarily classified as non‐native. We applied several strategies for assessing evolutionary conservation of predicted hot spot residues. For each predicted hot spot, sequence conservation at that position was calculated using the multiple sequence alignments for interologs, with a conserved hot spot recorded each time the amino acid in an ortholog matched at a predicted hot spot. We also examined conservation at adjacent positions, using windows of 3, 5, 7, and 9 residues centered at the predicted hot spot. In addition, we considered whether performance was improved by filtering to ensure a minimum number of hot spots in each binding partner. We report results for a “windowed” strategy with window size 3 and a “filtered” strategy requiring at least 6 hot spots per binding partner. Finally, these calculations were repeated using a more lenient definition of conservation based on chemical type according to the following classifications: small (GA), hydrophobic (VILMP), aromatic (FYW), nucleophilic (STC), amide (NQ) acidic (DE), and basic (HKR). In addition to the decoy structures, we examined the native structure (remodeled using the same comparative modeling procedure as for the decoys) in order to compare its score with near‐native and non‐native examples. For the smaller Data Set 1, we studied structure‐based models of interolog interfaces on a subset of the 1000 decoys. Each decoy structure was used as a template for comparative modeling of the interolog interfaces. After collecting hot spot data on the template and each of the orthologous models, we evaluated the conservation of these hot spots between the template and the interolog interfaces. Hot spots were considered conserved if they were both spatially and chemically conserved. This meant that they needed to be within 3 Å of each other in the three dimensional space, and also that they must be of the same chemical type using the classifications given above. For example, if a valine residue on the template was within 3 Å of a methionine residue of an ortholog, and both were predicted as hot spots, the hot spot was considered conserved.

RESULTS

To rescore examples using sequence conservation, we averaged the fraction of conserved hot spots obtained for each binding partner. As previously described, this was done using several strategies, including filtering to require at least six predicted hot spots in each binding partner, and windowed conservation around the predicted hot spots. Supporting Information Tables S3 and S4 show detailed results for eight strategies: absolute conservation unfiltered/unwindowed (A), filtered (AF), windowed (AW), and filtered/windowed (AFW), plus each of these four strategies applied using chemical conservation (C, CF, CW, CFW). Systems for which near‐native hits were present among the decoys will be referred to as viable. Note that all systems become viable with the inclusion of the native structure among the decoys. In Figure 1, one can observe some of the differences among scoring strategies. For benchmark example 1XQS, the filtered strategies eliminate some non‐native structures that scored highly, these examples likely had a hot spot or two in each partner that happened to be conserved by chance. The windowed strategies perform somewhat worse on this example, with overall conservation dropping in both partners. The chemical strategies do not make much difference overall, but the rank of the native structure is improved relative to near‐native hits.

Figure 1

The benchmark example 1XQS highlights some of the possible scoring variations produced by the eight sequence‐based scoring strategies. The filtered strategies remove some of the high‐scoring non‐native predictions. The benefit of windowed and chemical conservation varies among examples; in this case, the windowed strategies performed somewhat worse overall than the unwindowed strategies. The chemical strategies performed slightly better than the strategies based on absolute sequence conservation, particularly on the native structure. Table 1 gives the number of systems generating top 10 and top 1 hits for the various scoring strategies and data sets, and results at the system‐specific level are detailed in Supporting Information Tables S3–S8. The CWF strategy produced the most systems with hits in the top 10 in both data sets, with 14/17 (82%) examples with a top 10 hit for Data Set 1 and 25/55 (45%) examples with a top 10 hit for Data Set 2. Other strategies saw a range of 53–76% success on Data Set 1 and 42–45% success on Data Set 2. Upon adding the native structure to the collection of decoys, the CWF strategy generated top 10 hits for 17/22 (77%) of examples in Data Set 1 and 31/85 (36%) of examples in Data Set 2. Other methods were successful in 50‐77% of examples in Data Set 1 and 34–39% of examples in Data Set 2 when including the native structure.

Table 1

For Each of the Eight Strategies (A, AF, … CW, CFW), the Number of Systems Returning Top 10 and Top 1 Hits is Given, Along with the Number of Viable Systems for Which Hits Were Present

	Viable	A	AF	AW	AFW	C	CF	CW	CFW	ANY
Top 10 Results
Data Set 1	17	10	13	9	11	10	13	12	14	15
Data Set 1 + Native	22	13	17	11	14	13	15	15	17	19
Data Set 2	55	24	25	23	23	23	25	23	25	37
Data Set 2 + Native	85	32	33	27	28	29	33	32	31	52
Top 1 Results
Data Set 1	17	2	4	5	6	2	5	6	6	12
Data Set 1 + Native	22	5	5	8	7	6	5	12	10	18
Data Set 2	55	8	13	8	12	9	8	9	12	23
Data Set 2 + Native	85	12	17	12	13	12	13	17	15	36

By adding the native structure, hits are present in all systems, and the analysis is repeated using this data.

For Each of the Eight Strategies (A, AF, … CW, CFW), the Number of Systems Returning Top 10 and Top 1 Hits is Given, Along with the Number of Viable Systems for Which Hits Were Present By adding the native structure, hits are present in all systems, and the analysis is repeated using this data. For generating top 1 hits, the CWF strategy was successful on 6/17 (35%) of examples from Data Set 1 and 12/55 (22%) of examples from Data Set 2. Other strategies ranged in success from 12‐35% on Data Set 1 and from 15‐24% on Data Set 2. Upon addition of the native structure as a decoy, CWF was successful in 10/22 (45%) of cases for Data Set 1 and in 15/85 (18%) of cases for Data Set 2. Other strategies were successful in 23–55% of examples in Data Set 1 and 14–20% of examples in Data Set 2 when including the native structure. To better understand the difference in success rates between Data Sets 1 and 2, we analyzed the subset of examples common to both data sets. The decoys used in each case were the same, but the selection of orthologs was different. The orthologs used in Data Set 1 were checked more carefully to ensure similar function, while the orthologs in Data Set 2 are more extensive but also include more speculative data such as “putative” protein sequences. Table 2 summarizes the comparison, which shows that the Data Set 1 alignments produced top 10 and top 1 hits at significantly higher rate than alignments in Data Set 2, regardless of the conservation strategy used. Figures showing the relationship between iRMSD and hot spot conservation for these examples can be found in Supporting Information Figure S1.

Table 2

	Viable	A	AF	AW	AFW	C	CF	CW	CFW	ANY
Top 10 Results
Data Set 1	16	10	12	8	10	10	12	11	13	14
Data Set 1 + Native	20	13	15	10	13	12	13	13	15	17
Data Set 2	16	6	7	7	7	5	8	7	8	11
Data Set 2 + Native	20	8	9	7	9	6	10	8	9	13
Top 1 Results
Data Set 1	16	2	4	4	5	2	5	5	5	11
Data Set 1 + Native	20	5	5	7	6	6	5	10	8	16
Data Set 2	16	1	3	1	3	1	1	1	4	7
Data Set 2 + Native	20	2	5	2	3	2	2	3	5	10

By adding the native structure, hits are present in all systems, and the analysis is repeated using this data.

Using Only the 20 Examples Common to Data Sets 1 and 2, the Table Gives the Number of Systems Returning Top 10 and Top 1 Hits Is Given, Along with the Number of Viable Systems for Which Hits Were Present By adding the native structure, hits are present in all systems, and the analysis is repeated using this data. In addition to sequence‐based analysis, we constructed homology models of interolog interfaces using a subset of decoys and sequences from Data Set 1. The structure‐based conservation strategy is most similar to the C or CW sequence‐based strategies, since we used the chemical definition of conservation and allowed spatial alignment of predicted hot spots. The general trends seemed to match those of the sequence‐based approaches, with roughly the same set of systems generating high‐scoring hits. One interesting difference was that the range of scores among near‐native hits was often more narrow for the structure‐based analysis (c.f., 1AY7, 1B6C, 1E6E, 1GLA, 1Z0K). In these cases, it is likely that decoy structures generated non‐native hot spot predictions that were also predicted in the modeled interolog structures but did not reflect the true hot spots of the complex. In this respect, the structure‐based modeling may offer some improvement over the simpler sequence‐based approach. Figures showing the relationship between iRMSD and (total) hot spot conservation for these examples can be found in Supporting Information Figure S2.

DISCUSSION

Our results have been structured to facilitate easy comparison with the work of Andreani et al. on the InterEvol Score.54 They assessed their method by rescoring ZDOCK decoys, ranking them both individually and within clusters. For Data Set 2, they were able to achieve a top 10 hit in 14 cases; following a clustering procedure, this improved to 24 cases. Their results were shown to be an improvement over several existing interface scoring methods, including ZDOCK 3.0,64 ZRANK,65 and SPIDER.66 We examined a smaller number of decoys and did not apply clustering, since some viable systems return very few hits. Without clustering, our approach was successful in 25 cases using the CWF strategy when using the same sequence alignments. An extensive recent survey67 suggests that the best current methods are able to achieve top 10 hits in <60% of cases and a top 1 hit in <30% of cases. While our study is smaller due to data limitations, we achieved a top 10 hit in 14/17 (82%) of viable cases for Data Set 1 using the CWF strategy. A top 1 hit was obtained in 6/17 (35%), and when the native structure was added, this number jumped to 10/22 (45%). A total of 18/22 (81%) structures returned a top 1 hit according to one of the strategies when the native structure was included. For Data Set 2, top 10 hits were obtained in 25/55 (45%) and top 1 hits for 12/55 (22%) of viable systems using the CWF strategy. Upon addition of the native structures, the fraction of systems returning top 10 hits was 31/85 (36%), and the fraction returning top 1 hits was 15/85 (18%) when using the CWF strategy. The results presented in Table 2 suggest that many of the unsuccessful systems from Data Set 2 may be successful using our approach if given an optimal collection of interologs. It is yet unclear how to define the “sweet spot” in choosing orthologs, since high levels of sequence homology offer little information, while divergent examples may not display sufficient levels of conservation at predicted hot spots. In addition to success rates, it is interesting to note how scoring of the (remodeled) native structures compare with scoring of near‐native and non‐native decoys. In many cases, the native structure scored at or near the top, while in other cases it scored significantly lower than near‐native hits. In cases where performance on native structures was notably worse than for near‐natives, the structures generated an abundance of predicted hot spots. This suggests that false positive predictions may be the reason for this difference. In addition, many near‐natives did somewhat better than native structures when using windowed strategies. In these cases, it seems likely that some hot spot predictions were made at true hot spots in the native structure while near native structures generated hot spot predictions at neighboring positions.

CONCLUSION

Through the study of interolog interactions, we observe a strong correlation between low iRMSD and hot spot conservation across a range of systems. Our analysis suggests that, with appropriate choice of interologs and good sampling of near‐native structures, our hot spot‐based approach can return top 10 and top 1 hits at a comparable or higher rate than the best existing methods. While these initial results are very promising, it is clear that further improvements are possible through better selection of interolog sequences and refinements to the KFC2a model able to reduce false positive predictions. The windowed and filtered strategies can also be refined, such as by taking the maximum conservation at any residue within the window or by setting the filtering threshold according to the size of the protein–protein interface. Future work will also consider how to select the “best” near‐native decoys, such as those with the highest fraction of near‐native contacts, since correct prediction of hot spots is likely to correlate with correct prediction of native residue‐residue interactions. As high‐quality resources such as InterEvol60 and 3D Interologs59 develop alongside our evolutionary‐driven scoring models, there is an opportunity to learn more about molecular evolution at the atomic scale. Supporting Information Click here for additional data file. Supporting Information Click here for additional data file. Supporting Information Click here for additional data file.

66 in total

1. Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations.

Authors: Raphael Guerois; Jens Erik Nielsen; Luis Serrano
Journal: J Mol Biol Date: 2002-07-05 Impact factor: 5.469

2. Computational mapping of anchoring spots on protein surfaces.

Authors: Avraham Ben-Shimon; Miriam Eisenstein
Journal: J Mol Biol Date: 2010-07-17 Impact factor: 5.469

3. Identification of interacting hot spots in the beta3 integrin stalk using comprehensive interface design.

Authors: Jason E Donald; Hua Zhu; Rustem I Litvinov; William F DeGrado; Joel S Bennett
Journal: J Biol Chem Date: 2010-10-07 Impact factor: 5.157

4. DrugScorePPI webserver: fast and accurate in silico alanine scanning for scoring protein-protein interactions.

Authors: Dennis M Krüger; Holger Gohlke
Journal: Nucleic Acids Res Date: 2010-05-28 Impact factor: 16.971

5. RosettaBackrub--a web server for flexible backbone protein structure modeling and design.

Authors: Florian Lauck; Colin A Smith; Gregory F Friedland; Elisabeth L Humphris; Tanja Kortemme
Journal: Nucleic Acids Res Date: 2010-05-12 Impact factor: 16.971

6. 3D-interologs: an evolution database of physical protein- protein interactions across multiple genomes.

Authors: Yu-Shu Lo; Yung-Chiang Chen; Jinn-Moon Yang
Journal: BMC Genomics Date: 2010-12-01 Impact factor: 3.969

7. Presaging critical residues in protein interfaces-web server (PCRPi-W): a web server to chart hot spots in protein interfaces.

Authors: Joan Segura Mora; Salam A Assi; Narcis Fernandez-Fuentes
Journal: PLoS One Date: 2010-08-23 Impact factor: 3.240