Literature DB >> 32956350

Efficient consideration of coordinated water molecules improves computational protein-protein and protein-ligand docking discrimination.

Ryan E Pavlovicz^1,2, Hahnbeom Park^1,2, Frank DiMaio^1,2.

Abstract

Highly coordinated water molecules are frequently an integral part of protein-protein and protein-ligand interfaces. We introduce an updated energy model that efficiently captures the energetic effects of these ordered water molecules on the surfaces of proteins. A two-stage method is developed in which polar groups arranged in geometries suitable for water placement are first identified, then a modified Monte Carlo simulation allows highly coordinated waters to be placed on the surface of a protein while simultaneously sampling amino acid side chain orientations. This "semi-explicit" water model is implemented in Rosetta and is suitable for both structure prediction and protein design. We show that our new approach and energy model yield significant improvements in native structure recovery of protein-protein and protein-ligand docking discrimination tests.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2020 PMID： 32956350 PMCID： PMC7529342 DOI： 10.1371/journal.pcbi.1008103

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

This is a PLOS Computational Biology Methods paper.

Introduction

Water plays a significant role in biomolecular structure. The hydrophobic effect drives the collapse of proteins into their general shape while highly coordinated water molecules (water molecules making multiple water-protein hydrogen bonds) on the surface of a protein may confer specific conformations to nearby polar groups. Furthermore, water plays a key role in biomolecular recognition: when a ligand binds its host in an aqueous environment, it must displace water molecules on the surface and energetically compensate for the lost interactions[1]. Coordinated water molecules may also drive host-ligand recognition by bridging interactions between polar groups on each side of the complex. Simulations of proteins in explicit solvent have been successful in predicting folded conformations[2] as well as computing binding free energies[3] with high accuracy. Explicit solvent calculations are computationally expensive, particularly in Monte Carlo simulations where a long water equilibration period might be required. Such a cost may be alleviated through the use of an implicit solvent[4] model, which while more efficient, incurs a loss of accuracy by disregarding the energetics of highly-coordinated water molecules[5]. Thus, an approach combining the efficiency of implicit solvation with the ability to recapitulate well-coordinated water molecules is desired. Several such methods have been developed but tend to be developed for specific types of interactions (eg. protein-protein or protein-small molecule ligand)[6-11] or are computationally expensive[12]. In this paper, we describe the development of general methods for capturing the energetic effects of explicit solvent, but with the computational efficiency of an implicit solvent model. Our intent is that this energy model is better at discriminating the correct binding modes of protein-protein and protein-ligand complexes. These new methods include: 1.) a new energy function that implicitly captures the energetics of protein and coordinated-water interactions and 2.) a conformational sampling approach that efficiently samples protein and explicit water conformations simultaneously. We show that these methods enable us to predict water positions accurately, as well as improving our ability to discriminate native protein-protein and protein-ligand interfaces from non-native decoy conformations.

Results

Our approach for modeling coordinated water molecules using Rosetta, fully described in Methods, is briefly presented here. We have developed two complimentary approaches for capturing coordinated-water energetics. We hypothesize that more accurately modeled interface waters will lead to better discrimination of correct binding modes from incorrect (decoy) binding modes. First, Rosetta-ICO (Implicit Consideration of cOordinated water), implicitly captures pairs of polar groups arranged such that a theoretical “bridging” water molecule may form favorable hydrogen bonds to stabilize the interaction. This calculation is efficient but ignores multi-body interactions that may favor, for example, waters coordinated by >2 hydrogen bond donors or acceptors. While this implicit water model is more accurate than our prior model, which did not consider these water molecules at all, modeling a subset of waters explicitly should further improve model accuracy. Therefore, we have also developed Rosetta-ECO (Explicit Consideration of cOordinated water), in which Rosetta’s Monte Carlo (MC) simulation is augmented with moves to add or remove explicit solvent molecules from bulk. By sampling water orientations at sites where predicted bridging waters overlap (Fig 1E), we properly coordinate water molecules to optimize hydrogen bonding.

Fig 1

Implicit and explicit treatment of water In Rosetta.

Implicit and explicit treatment of water In Rosetta.

Implicit water score function potentials, panels A-D. Potential plots were generated by orienting the N-H and C = O groups of two ALA residues along the same axis with a H—O distance of 1.3 Å (origin). The donor residue is then shifted +/- 7 Å to generate a planar cut of the solvation potentials between the N and O atoms. All plots have units of kcal/mol[13, 14]. (A) fa_sol term: isotropic desolvation penalty implemented in Rosetta using the Lazaridis-Karplus model. (B) lk_ball term: anisotropic correction for polar atom types, first introduced into the REF2015 score function. (C) lk_bridge term: anisotropic solvation reward introduced into the Rosetta-ICO score function. (D) Composite of panels A-C, using the finalized Rosetta-ICO score term weights. Explicit water placement with Initial possible solvation sites (blue) are based on statistics of water positions around backbone polar atoms in addition to sites around side chain polar atoms considering all possible non-clashing rotamers. Pictured is the interface of PDB ID 1P57, between the N-terminal (pink) and catalytic (teal) domains of hepsin, with crystallographic waters in transparent grey. (F) After an initial stage of Monte Carlo packing of both the possible water sites and surrounding protein side chains, a cutoff is applied based on the water occupancy of each site over the simulation (blue = 0% occupancy, green = 25%, red = 50%). (G) Remaining water sites are clustered, and a second cumulative dwell time cutoff is applied. (H) The final predicted water sites are converted into three-atom water molecules and the orientation is reoptimized together with nearby sidechain conformations using the Rosetta all-atom energy function. For both approaches, the Rosetta energy function has been reoptimized using the dualOptE framework described by Park et al.[14]. In this optimization, several meta-parameters describing the shape of the Rosetta-ICO potential; several terms controlling the strength and shape of protein-water interactions; and ~50 other per-atom polar parameters were optimized to allow for compensating changes to the new energy terms. Energy function parameters for polar groups, including partial atomic charges, were refit using the same training tasks originally used in the parameterization of the opt-nov15 energy function[14], now called REF2015[13]. While all parameters were optimized for Rosetta-ICO (see S7–S10 Tables in S1 Text for final values), only a subset of water-specific parameters were refit when developing the explicit water terms for Rosetta-ECO. The results in this section are shown with the updated energy functions compared to baseline tests run using the REF2015 energy function[14].

Rotamer and water recovery at protein-protein interfaces

A set of 123 native protein-protein interfaces from high-resolution X-ray crystal structures was used to test how well the new energy models perform at simultaneously predicting amino acid side chain conformations and coordinated water sites (data set details may be found in S2 Text). Tests involved the re-sampling of side chain conformations of interface residues on a fixed backbone in MC simulations and evaluating resulting predicted side chains against the deposited density maps. In tests involving semi-explicit water molecules (Rosetta-ECO), we simultaneously sample protein side chain conformations and water placements. A baseline rotamer recovery error of 9.73 ± 0.13% was obtained using the REF2015 energy function for the 7040 flexible side chains of the test set. A marginal improvement is made with Rosetta-ICO, reducing error to 9.52 ± 0.04%. Inclusion of explicit water molecules in this test fails to further decrease the overall rotamer recovery error beyond the improvements observed with Rosetta-ICO, with a Rosetta-ECO error of 9.59 ± 0.15%, while predicting ~19 explicit water molecules per protein-protein interface. For reference, side chain packing tests that use “native” water molecules (this would be the result of perfect water recall and precision) achieves a rotamer side chain recovery error of 8.36 ± 0.04%, while random perturbation of these waters suggest a placement tolerance of less than 0.8 Å (S17 Fig). In addition to measuring side chain rotamer recovery at the protein-protein interfaces, we also analyzed the recovery of water positions found in the high-resolution X-ray crystal structures when implementing the Rosetta-ECO solvation method. For water recovery tests, modeled water positions are considered “correct” if they are placed within 0.5 Å of the native water or if they are coordinated by the same polar atoms. Using this strict criteria, Rosetta-ECO is able to recover 17.7% of native water molecules with a precision of 17.7%. Details of Rosetta-ECO water recovery are shown in Table 1. These data show that our approach is most effective at predicting “buried” waters (28.3% recovery) and highly coordinated waters (31.8% recovery of triply coordinated waters). Unsurprisingly, Rosetta-ECO is also much more effective at predicted backbone-coordinated waters, correctly predicting 50.0% of backbone-only coordinated waters. An example of two correctly predicted water sites is illustrated in Fig 1H.

Table 1

Classification of predicted native waters (test set of 123).

		Rosetta-ECO
Type¹	Subset Size	% recovered²	% precision³
All	2815	17.7 (0.08)	17.7 (0.08)
Exposed	630	6.0 (0.13)	4.7 (0.1)
Partially Buried	1803	19.5 (0.39)	21.7 (0.5)
Buried	382	28.3 (1.19)	27.5 (1.3)
1 protein coord	770	6.3 (0.12)	5.0 (0.2)
2 protein coord	1077	27.2 (0.24)	25.3 (0.3)
3 protein coord	399	31.8 (0.43)	26.2 (0.4)
BB only	330	50.0 (1.24)	23.1 (0.4)
SC only	333	7.8 (0.65)	18.1 (1.1)
BB+SC	440	27.6 (0.18)	26.6 (0.3)

2-3Percent of specific types of waters recovered using recovery criteria described in Methods, averaged over three runs with standard deviations in parentheses.

1Three groups of categorization of type of predicted water molecules. First, waters are classified ‘buriedness’ based on number of amino acid neighbors (nCβ) with Cβ within 10 Å. Exposed: nCβ < = 15; partially buried: 15 < nCβ < = 25; buried: nCβ > 25. Second, classification by 1, 2, or 3 protein coordination partners within 3.2 Å. Finally, by type of coordinating protein atoms with 3.2 Å of the water O atom: at least two backbone only (BB only), side chain only (SC only) or a mix of backbone and side chain coordination (BB+SC). 2-3Percent of specific types of waters recovered using recovery criteria described in Methods, averaged over three runs with standard deviations in parentheses. Rosetta-ECO predicts an average 9.1 waters per 1000 Å2, compared to an observed average of 9.3 waters per 1000 Å2 of interface surface area, which is in line with previous analyses of interface solvation[15]. Thus, the ECO protocol hydrates protein interfaces to a similar degree as to what has been observed in crystallographic structures. Finally, the results of Rosetta-ECO were compared against solvent placement using the 3D-RISM methodology as implemented in AmberTools19[16]. 3D-RISM, like most other water site prediction methods, operates on a fixed protein model (in this case, the crystallographic structures). In our tests, 3D-RISM recovered 22.9% of the full interface water data set, ~5% more than ECO when calibrated to the same level of precision (See S2 Table for detailed results). Rosetta-ECO, which predicts water positions in addition to protein side chain conformations, performs particularly strongly at recovering waters that are exclusively coordinated by backbone groups (Table 1), outperforming 3D-RISM by 35% for this classification of water. Overall, the 3D-RISM calculations take ~20-fold longer to run (S3 Table). While other computational water predictions methods exist that are faster or more accurate than Rosetta-ECO, to our knowledge, they are all benchmarked against static protein structures, making direct comparison to Rosetta-ECO inappropriate. For example, WaterDock[9], which uses AutoDock Vina to predict water positions using a grid-based docking approach, was developed for computational drug design purposes, with a focus on small molecule binding sites as opposed to large protein-protein interfaces. Using a recovery cutoff of 1.4 Å, WaterDock reports a recovery of 87% of crystallographic waters to a set of 14 OppA crystal structures bound to different KXK tripeptides, with runtime on the order of seconds. WaterMap[11], on the other hand, relies on 2 ns MD simulations on a fixed protein. This leads to significantly longer run times (on the order of hours), but can yield highly accurate results: in a study using a dataset of 41 crystallographic water sites at nine bromodomain/ligand complex interfaces, WaterMap accurately predicted more than 70% of the experimental water positions within 0.5 Å[17]. 3D-RISM, which was also benchmarked in this study, recovered slightly more than 30% using the same recovery cutoff. Finally, we also applied Rosetta-ECO to CAPRI Target 47[18], a homology modeling challenge of a protein/protein interface including the blind prediction of water molecules at the modeled interface. Our results, described in detail in S3 Text, places our best modeling effort with within range of the top-scoring submissions to the modeling challenge. One of our models places 13 water molecules at the modeled protein/protein interface, 11 of which come within 2.0 Å of one of the 22 crystallographic interface water molecules, making for a true positive prediction rate of 50% while only placing two additional water molecules not observed in the crystal structure.

Native interface recapitulation

We next tested the ability the new energy model to recapitulate near-native conformations of protein-protein interfaces (PPIs) and protein-ligand interfaces. In these tests, which were not used at all in parameter training, the binding free energies for a number of near-native and incorrect (decoy) docking conformations of each complex are computed with the aim of discriminating the correct binding poses from the decoys. PPI decoys were sampled using a combination of Zdock[19] and RosettaDock[20], while protein-ligand decoys were generated using RosettaLigand[21]. Both datasets were enriched for water-rich interfaces, leading to 53 protein-protein and 46 protein-ligand interface datasets. Predicted interface energies, ΔGbind, were calculated for all decoys as described in Methods. We assess the ability to predict the near-native conformations using a “discrimination score,”[14] which computes the Boltzmann weight of near-native structures. The values range from 0 to 1, with higher values showing better discrimination. We also assess with a noisier (but more interpretable) “percent correct” metric, which identifies the number of cases in which near-native bound conformations have lowest energy. An overview of the results is shown in Table 2, while results for all cases are presented in S2 Fig through S10 Fig. Select cases in which the inclusion of predicted explicit water molecules improved native discrimination are detailed below.

Table 2

Performance of solvation schemes on protein-protein and protein-small molecule docking discrimination.

	REF2105	Rosetta-ICO¹	Rosetta-ECO²
Protein-small molecule
discrimination score³	0.749 ± 0.003	0.807 ± 0.002	0.873 ± 0.003
percent correct⁴	77.1 ± 2.1	77.8 ± 1.8	94.1 ± 1.1
run time⁵	1.00	1.09	1.52
Protein-protein
discrimination score	0.628 ± 0.014	0.739 ± 0.006	0.794 ± 0.004
percent correct	63.6 ± 0.9	74.9 ± 0.9	79.9 ± 2.3
normalized run time	1.00	1.25	2.59

1Implicit consideration of coordinated water molecules

2Inclusion of well-ordered explicit water molecules

3Reported are the average Boltzmann-weighted discrimination scores ± 1σ averaged over three independent runs for 46 protein-ligand and 53 protein-protein docking cases

4The percentage of cases in which the lowest scoring model is within 1.0 Å of the native conformation for protein-ligand docking and 2.0 Å for protein-protein docking, averaged over 3 independent runs

5Run time, normalized to baseline, is the sum of individual run times to calculate ΔGbind for each near-native and decoy conformation

1Implicit consideration of coordinated water molecules 2Inclusion of well-ordered explicit water molecules 3Reported are the average Boltzmann-weighted discrimination scores ± 1σ averaged over three independent runs for 46 protein-ligand and 53 protein-protein docking cases 4The percentage of cases in which the lowest scoring model is within 1.0 Å of the native conformation for protein-ligand docking and 2.0 Å for protein-protein docking, averaged over 3 independent runs 5Run time, normalized to baseline, is the sum of individual run times to calculate ΔGbind for each near-native and decoy conformation

Protein-protein docking discrimination

In protein-protein docking discrimination tests with binding modes that broadly sample conformational space[14], significant improvements are observed when comparing Rosetta-ICO to the baseline results, with the discrimination score increasing from 0.63 to 0.74. Rosetta-ECO further improves this discrimination score to 0.79. We also consider the “success rate,” the time the lowest-energy conformation is within 2.0 Å of native: the ECO model enables successful prediction of a near-native conformation in 8 additional cases out of the set of 53, a ~15% improvement. This comes at a modest increase in computational cost, with an average 1.25- and 2.59-fold increase in runtime for ICO and ECO, respectively. As illustrated in Fig 2A, Rosetta-ECO improves the discrimination score for 38 of 53 cases, adding 13.4 water molecules to the average bound state and 15.0 water molecules to the average unbound state. These average improvements remain statistically significant. Looking at one such case (adrenodoxin reductase/adrenodoxin, PDB ID 1E6E), we see that while all three energy models correctly predict a near-native conformation, the “energy gap” between native and non-native conformations is improved under Rosetta-ECO (Fig 2B). Closer investigation of the near-native models shows 21 explicit water molecules added to the binding interface. The combined electrostatic and hydrogen bond energy contributions compose a large proportion of the improved binding energy, 5.2 kcal/mol more favorable than Rosetta-ICO for this particular binding configuration.

Fig 2

Protein-protein docking results.

(A) Scatter plot comparing results of 53 cases between REF2015 and Rosetta-ECO. Values are the average Boltzmann-weighted discrimination score ± 1σ from three independent runs. (B) Energy funnels for PDB ID 1E6E, adrenodoxin reductase bound to adrenodoxin (red data point in 2A), plotting computed ΔGbind vs. RMSD from the native binding conformation for three different scoring methods. Discrimination scores for each distribution are noted in bottom right of each plot. (C) Explicitly solvated near-native docking pose (RMSD = 0.14 Å; pink data point in 2B) with the reductase in grey and adrenodoxin in rainbow (N- to C-terminus colored blue to red). (D) Coordination of some predicted interface waters.

Protein-protein docking results.

Protein-ligand docking discrimination

For protein-ligand docking discrimination tests, Rosetta-ICO again shows an improvement over REF2015, with average discrimination score increasing from 0.75 to 0.81. Rosetta-ECO further increases the discrimination score to 0.87. In terms of “success rate”, we see the same trend as with PPIs: Rosetta-ECO enables the correct prediction (within 1.0Å of native) in 7 additional cases out of the 46. These results indicate that both Rosetta-ICO and ECO help discriminate distant decoys from native conformations when compared to the REF2015 energy model, with the inclusion of explicit water modeling in ECO conferring the largest benefit. This also comes at only a modest increase in run time: about 10% increased time for ICO, and about 52% increased computation time for ECO. The improvements in discrimination score on a per-case basis are illustrated in Fig 3A. Here, we see that Rosetta-ECO provides a nearly across-the-board improvement in native discrimination compared to the baseline calculations. The individual energy distributions for PDB ID 1X8X (tyrosyl t-RNA synthase / tyrosine) in Fig 3B show how both REF2015 and Rosetta-ICO incorrectly favor a decoy 6.6 Å from native. Rosetta-ECO’s explicit waters dramatically alter the binding energy landscape, improving the discrimination score from 0.27 to 0.89, and energetically favoring a structure only 0.43 Å from native. The ECO model predicts two water molecules that bridge the carboxyl group of the tyrosine ligand to interactions with an arginine side chain and a backbone nitrogen group (Fig 3C). Comparing the structure to the native crystal structure (at 2 Å resolution), we find that these two waters are 0.25 Å and 0.93 Å from native water positions; a third, more exposed water—also visible in Fig 3C—comes within 1.6 Å of a native water. Additional examples comparing recovered waters to crystal structures (PDB IDs 1N2J and 1U4D, at 1.8 Å and 2.1 Å resolution, respectively) are illustrated in panels 3E-H, illustrating four waters all within 1 Å from a crystallographic water position (see S11 Fig & S12 Fig for full solvated binding modes).

Fig 3

Protein-ligand docking results.

(A) Scatter plot comparing results of 46 cases between baseline (REF2015) and Rosetta-ECO. Values are the Boltzmann-weighted discrimination score ± 1σ from an average of three independent runs. (B) Energy funnels, similar to Fig 2, for PDB ID 1X8X, tyrosyl t-RNA synthase bound to tyrosine (red data point in 3A) C. Explicitly-solvated, near-native docking pose in pink (RMSD = 0.43 Å; pink data point in 3B) with native ligand in transparent blue. (D) Explicitly-solvated decoy binding pose (RMSD = 6.57 Å; yellow data point in 3B). (E-H) A comparison of recovered waters (red) to high-resolution crystallographic waters (green spheres) from PDB ID: 1N2J (Panels E-G) and PDB ID: 1U4D (Panel H).

Protein-ligand docking results.

Ligand docking scoring comparison

Finally, the new energy functions were compared against the results of a state-of-the art docking approach on a standardized dataset. A recent survey[22] of widely-used small molecules docking programs tested for performance against the Astex Diverse Set[23] which includes 85 targets with ligands of pharmaceutical interest. We generated decoys for a 67-target subset (omitting cases where the ligand was additionally coordinated by an ion) using the docking software GOLD[24]. The GOLD-sampled structures were then rescored using the REF2015, ICO, and ECO energy functions of Rosetta. The results, fully presented in S1 Fig and S1 Table, show that while the Rosetta-rescored structures are more accurate than GOLD (78.2% versus 67.7% accuracy within a 1 Å RMSD cutoff; 94.6% versus 80.7% accuracy within 2 Å RMSD cutoff), little improvement is observed between REF2015 and ICO/ECO. While these results suggest Rosetta may be a powerful tool for this dataset, the restricted conformational sampling obtained from GOLD (see S13 Fig for examples of sampling in RMSD space) does not benefit from the water model developments presented here and prevents a thorough evaluation of the energy functions. It is likely that a more evenly distributed set of docking conformations would yield results similar to the score function improvements observed in the more tightly-curated protein/protein and protein/ligand data sets described above.

Discussion

We have presented two approaches for considering coordinated water molecules in the prediction of native protein-protein and protein-ligand interfaces: Rosetta-ICO, which very efficiently captures the energetics of bridging waters implicitly, and Rosetta-ECO, which allows a small set of waters to emerge from bulk, resulting in a more physically complete representation of protein surfaces and interfaces. Both methods show improvements in protein interface recapitulation tasks with different levels of efficiency/accuracy tradeoffs: Rosetta-ECO more is accurate when it comes to decoy discrimination tests but 1.5–2 times slower than Rosetta-ICO depending on interface size. The level of native water recovery for Rosetta-ECO is about ~5% less than 3D-RISM for a similar precision level, yet the ECO model performs this task at ~10-fold increased speed while simultaneously predicting interface side chain configurations. While the precision and recall reported by our explicit method might seem low, this is due to several factors. First, we are using a very strict recovery tolerance (0.5 Å, compared to 1.4–2.0 Å used elsewhere[9, 18]. Second, Rosetta-ECO is performing (and was designed to perform) a fundamentally different task than other approaches: simultaneously predicting both side chain geometry and coordinating waters. Nevertheless, our native water recovery numbers are encouraging when compared to a fixed-structure approach such as 3D-RISM, where results are similar when both methods are applied to the same PPI data set using the same recovery criteria. Furthermore, while this work highlights the results of water prediction and protein interface recapitulation, we might expect the Rosetta-ICO energy function to show modest improvements at tasks related to monomeric structure prediction and protein sequence design. Indeed, that seems to be the case: when tested on independent datasets, modest improvements were observed in decoy discrimination with ICO. All other metrics were comparable between the two energy functions, leading us to conclude that the ICO model is a reasonable general-purpose energy function. The improvement in both the protein and ligand docking tests suggests that these new energy functions may prove useful in the design of novel proteins intended to bind a particular ligand or protein. Successful design of protein-protein interfaces is often driven by van der Waals interactions that arise from shape complementarity, however better consideration of ordered solvent molecules may allow for the design of more natural interfaces which include numerous polar residues. Application of these new methods need not be limited to the solvation of interfaces or the description of binding partners. For example, the methods may be applied to more accurately predict the folded state of monomeric proteins in which buried solvent plays an important structural role or for prediction of the stabilizing or destabilizing effect of mutated residues on the surface of a protein. Additionally, the experiments described herein only consider the solvation of proteins and small molecules, however the framework can be easily extended to solvate other biomolecules such as nucleic acids.

Methods

Two new biomolecular solvation methods are introduced here. The first (Rosetta-ICO) builds upon the existing implicit water model used in Rosetta to not only account for the energy of desolvating protein functional groups, but to additionally energetically favor conformations that are suitable to accommodate bridging waters. The second model (Rosetta-ECO) places well-coordinated water molecules on the surface or at interfaces of biomolecules based largely on statistics from high-resolution experimental data.

Implicit solvation (Rosetta-ICO)

An additional energy term is added to the Rosetta’s implicit solvation model that models the energetic costs of highly ordered water molecules coordinated by multiple protein polar groups. The term builds upon our previously developed anisotropic solvation model[14], where for each polar group, one or more virtual water sites are placed in a configuration ideal for hydrogen bonding with the corresponding polar group. An energetic bonus is then given when the water sites of multiple polar groups overlap in such a way that a single water could coordinate, or “bridge”, these polar groups: With: Here, Elk(i,j) is the isotropic solvation term between atoms i and j (the fa_sol score term in the Rosetta energy function, see Fig 1A), is the xyz coordinate of a theoretic water oxygen atom corresponding to polar group ; is the xyz coordinate of the base heavy atom used to construct the water (e.g., the backbone N or O), and D0len, D0angle, S0len, and S0angle are parameters that are optimized during energy function evaluation, with final values of 0.5 Å, 4.33 Å, 1.61 Å, and 2.69 Å, respectively. Since a single polar atom may have multiple putative water binding sites, we take the minimum distance between all water sites corresponding to atoms i and j (the first term of the equation). Overall, the two terms in the equation characterize the overlap between potential water sites and the angle formed between polar groups that potentially coordinate a bridging water molecule. This energy term was added to the current anisotropic solvation model in Rosetta (illustrated in Fig 1A–1D), and optimization of all polar terms was carried out (see S1 Text). Since this term does not prevent certain disallowed coordination geometries (e.g., 3 donors or 3 acceptors coordinating a single water site), we have introduced the Rosetta-ECO model to include fully modeled water molecules at possible hydration sites that can help filter out conformations with poor coordination geometry. Additionally, because this two-body energy term is only dependent upon the configuration of pairs of protein polar groups, it can be used in all Monte Carlo minimization methods used in Rosetta[25], with negligible computational overhead. Additionally, to properly handle the geometry of water-protein and water-water hydrogen bonds, we modified the functional form of sp3-hybridized hydrogen bond acceptors. Previously, the interaction between a hydrogen bond donor and the lone pair electrons of sp3-hybridized acceptors was described by an angle and torsional term about the base atoms[26]; e.g., for serine, the angle CB-OG· · ·Hdonor and the pseudo-torsion HG-CB-OG· · ·Hdonor. For water, however, this led to an undesirable property in that the potential was not symmetric about the two water hydrogens. Therefore, in Rosetta-ICO (and Rosetta-ECO) we replace the torsional term for sp3 hydrogen bond acceptors with a “softmax” potential between both atoms bonded to the sp3-hybridized acceptor: Above, M describes the "softness" of the softmax with a default value of 0.4 kcal/mol (lower values make this function behave more like a “max”). The variables bk, ai and hj are the acceptor base atom, acceptor heavy-atom and donor hydrogen, respectively; and EBAH is the angular potential about the heavy-atom[26]. The summation is carried out over all bound atoms to the acceptor. For water acceptors, this would be over both hydrogens. In the serine example above, the angular potential is applied to both CB-OG· · ·Hdonor and HG-OG· · ·Hdonor, with the softmax giving a score roughly equal to the worse of the two angular potentials. This ensures the potential is symmetric about both water hydrogens.

Explicit solvation model (Rosetta-ECO)

One key challenge in prior explicit water modeling[27] is the large conformational space a single water molecule can adopt. This is an issue in applications (like those in this manuscript) where it is desirable to simultaneously sample side chain conformations and water positions. Rosetta-ECO makes use of a two-stage approach to navigate this problem (Fig 1E–1H). In the first stage, rotationally independent “point waters” are sampled using a statistical potential; not considering water rotation lets thousands of putative water positions be sampled efficiently. In the second stage, for the most favorable water positions (typically only several dozen) we consider rotations of these molecules using a physically derived potential. In both steps of the protocol, Monte Carlo sampling is used to simultaneously sample side chain and water conformational states. In both stages, water molecules may be set to “bulk,” losing an entropic penalty by doing so. This entropy bonus value, Ebulk, ultimately controls the number of explicit water molecules placed by the algorithm, requiring sufficient favorable physical interactions to overcome the entropic cost of coming out of bulk. This parameter was fit to a value of 1.22 kcal/mol. The atoms of any water molecule introduced into a model are subject to the same treatment by the full-atom Rosetta force field as any other atom, including interacting with bulk solvent via the lk_ball solvation model[14, 27]. Finally, rotational sampling of waters uses a uniform SO3 gridding strategy[28] with 30° angular spacing, leading to 270 rotational conformers per water.

Derivation of the statistical point water potential

The first step in determining possible water sites involves a low-resolution, statistical water potential to quickly evaluate the interaction between possible water sites and nearby polar groups of biomolecules. This potential, which we are calling the “point water potential”, treats water molecules as simple, uncharged, points with attractive and repulsive Lennard-Jones terms. The point water potential takes the form of: Here, P is the statistical point-water distribution, parameterized over distance and angle; d gives the distance between a water and polar atom, and θ gives the angle between the water position, the polar atom, and its “base atom.” The point water energy term also considers other nearby point water sites, k, as Gaussian distributions with width σ and height K (with min energy at a distance of 2.7 Å), which was determined by averaging water-water distances observed in high resolution crystal structures. K and σ were optimized, yielding values of 0.52 kcal/mol and 0.24 Å, respectively. Finally, an overall energetic cost of bringing the water molecule “out of bulk,” Epwat_bulk, is added for each water, with a value of 2.71 kcal/mol. These parameters were fit using crystallographic waters in the Top8000 database (see Supporting Information for more details).

Identifying and sampling point waters positions

A key challenging in building possible water sites is the desire to simultaneously sample side chain conformations along with water positions. Thus, the initial placement of water molecules to be optimized by the point water potential come from two sources: a) ideal solvation about protein backbones and b) possible solvation sites from side chain rotamers. For backbone waters, point generation is straightforward: 1 "ideal" site for each backbone N-H group and 10 “ideal” sites are generated from each backbone C = O group (based on clustering waters from crystal structures, S15 Fig & S16 Fig). Generation of side chain-coordinated waters is more involved. Considering all possible water molecules that may coordinate the polar groups of all side chain rotamers leads to water conformer sets that are unmanageably large to sample. Thus, we again build off prior work[29] and consider instead the overlapping hydration sites that emanate from two different side chain or backbone groups. That is, we collect the idealized hydration sites for all possible side chain rotamers and identify all positions where there is overlap (within 0.75 Å) between two potential water sites originating from different side chains or backbone groups. A 3D hash table makes this calculation efficient even when there are millions of putative water positions. Finally, to further reduce conformational sampling, during the Monte Carlo “packing” algorithm, when both side chain and point water positions are sampled, all putative point waters are clustered into sets in which only one site can be occupied. A modified version of Rosetta’s traditional packing algorithm[30] is used when point waters are present. Typically, Rosetta uses simulated annealing to find the discrete rotamer set minimizing system energy, where the temperature of the trajectory is slowly annealed from RT = 100 to RT = 0.3 kcal/mol. With the point water potential, we do not expect the force field (which does not consider water rotation) to be perfect, and we want the packer not to optimize total energy but to simply separate reasonable from unreasonable water positions for a more expensive subsequent calculation. Thus, we instead used long simulations at low temperatures (RT = 0.3) at which the "dwell time" of each state is recorded, with intervening high-temperature “spikes” (RT = 100) used to periodically scramble the state which may settle into various low-energy minima of the potential energy surface. Then, instead of taking the lowest energy state sampled, we measured water “occupancy” at each position, taking point water positions with a “dwell time” greater than 2% (ignoring occupancy counts during the high-temperature steps and the first 1/6 of low-temperature steps of each iteration). Water positions passing this criterion, typically on the order of dozens to hundreds, are then filled with three-point water molecules which are allowed to rotate about fixed oxygen positions and are sampled (along with all surrounding side chains) using Rosetta’s standard simulated annealing rotamer optimization routine. The Monte Carlo algorithm is unaltered from the standard packing routine in Rosetta, in that a random rotamer (side chain or water) is selected and tested against a Metropolis criterion. The only exception here is that when a water rotamer, which is a rotational state about a fixed oxygen position, is selected for sampling, there is a 50% chance that the “virtual” state/rotamer of the water molecule is sampled instead. Given that the first stage in the solvation routine places a substantially larger than expected population of water sites on the surface of the biomolecule, a majority of these sites will not result in water conformations that are well-coordinated by the surrounding protein or ligand polar atoms. This adjustment to the sampling of water states helps with the convergence of the sampling problem with so many potential false positive water molecules. Finally, during subsequent minimization of the water-containing model, the kinematics used to minimize water molecules (the fold tree) in Rosetta optimizes 6 degrees of freedom for each water, representing the rigid-body transformation between the water and nearest amino-acid (using Rosetta terminology, the “jump” is defined between the nearest amino acid and the water).

Datasets

Four different data sets were used in the testing of the new energy functions described here. The first includes 153 high-resolution crystal structures of protein-protein interfaces (PPIs) that was used for both native water and rotamer recovery at the interfaces. Two docking data sets were used to test the ability of the new energy functions to discriminate near-native from decoy docking conformations, a subsets of those used by Park et al.[14], but selected for water-rich interfaces (and to exclude problematic cases such as PPIs with disulfides across the interface or ions contributing to binding). For protein-protein interactions, a 53-case subset of the ZDock 4 Benchmark set[31] was used, while a 46-case subset of the Binding MOAD database[32] was used for protein-ligand interactions. Finally, another ligand docking set, generated with GOLD on a subset of the Astex Diverse Set[23] was used to compare the new energy functions against an established docking score function. All conformational sampling to generate the docking datasets was performed with fixed protein backbones. There is no overlap between the datasets used for parameter training and those used for the docking discrimination tests, and while there is significant overlap between the protein-ligand and Astex sets, these were used for different purposes and with significantly different sampling strategies. Additional details on the datasets, including lists of PDB IDs used are included in the Supporting Materials.

Benchmarking against 3D-RISM water site predictions

The water site predictions in Rosetta were compared against those predicted by the 3D-RISM method[33] as implemented in AmberTools19[16, 34]. Briefly, RISM calculations were performed for pure water at a concentration of 55.5 M with a 0.5 Å grid spacing. Using a buffer of 7 Å, as opposed to the default 14 Å, was found to be speed up calculations while not hurting recovery for our dataset which consists of water molecules found at PPIs. The Placevent algorithm[35] was used to determine explicit water sites, which were truncated to be found within 6 Å of all CB atoms (CA for GLY) of the residues that form the interfaces of the test set. This was done to be comparable to the Rosetta-ECO results, in which water sampling was limited to protein/protein interfaces. Finally, the results were further trimmed by the 3D-RISM water-protein radial distribution function (RDF > = 10.2) to achieve the same level of precision as Rosetta-ECO.

Binding energy calculations

The binding energies, ΔGbind, were calculated for the near-native and incorrect (decoy) docking poses by taking the difference between the computed energies of the bound and unbound states. This is accomplished in Rosetta by first calculating the energy for the bound system, then re-computing the energy when the two binding components are separated to obtain unbound state energies. An important part of interface energetics involves computing the energy cost of water displacement[36], making treatment of explicit waters of the unbound state an important consideration. Due to size differences of the average interface, we found slightly different treatment performed better with PPIs versus protein-ligand interfaces. In both PPIs and protein-ligand interfaces, the bound states are solvated (including reoptimization of interface side chains), using the two-stage Monte Carlo procedure described above, restricting water placement to only the biomolecular interface of interest. Given that this mode of solvation samples both side chain and water orientations, our strategy considers the induced fit effect on a fixed backbone level. Then all side chains are minimized and, for protein-ligand interfaces only with the ICO model, the rigid-body transformation between receptor and ligand is also minimized. Interface components are then separated and re-solvated. Copies of the waters from the bound state are duplicated such that one copy belongs to both ligand and receptor, while the re-solvation protocol restricts new water placement to the same region that defined the interface in the bound state. During the resampling of the unbound state, side chains that previously defined the interface are once again reoptimized, allowing waters that were previously highly coordinated in the bound state to be liberated to bulk if a sufficient part of this coordination was lost in the unbinding process. Any water molecules that remain un-liberated to bulk following sampling are considered part of the bound/unbound states for scoring purposes. RMSD values reported for docking are of the small molecule or protein ligand with respect to the native experimental structure. Ligand Cα RMSDs are used for protein-protein docking cases, where the ligand is the second chain in the experimental PDB file, while heavy atom RMSDs are used for small molecule docking cases. Sample XML scripts used for the protein/protein and protein/ligand rescoring are included in the Supporting Information.

Training tasks

The training tasks used for energy function parameterization are the same as detailed in the development of the REF2015 Rosetta energy function[14] and are summarized in the Supporting Information.

Data Set 1.

PDB files used for water recovery tests. Protein-Protein (16 GB) and Protein-Ligand (1.6 GB) decoys sets available at https://github.com/rpavlovicz/rpavlovicz-docking_data_sets. (DOCX) Click here for additional data file.

Rescoring GOLD docking results with Rosetta.

Results for rescoring Astex Diverse Set. Docking conformations initially generated and scored by GOLD (red) were rescored with the Rosetta REF2015 energy function (blue). The theoretical scoring success is determined by the initial GOLD sampling (black dashed) for the 67 cases of the Astex Diverse Set that do not coordinate an ion in the binding site. (PNG) Click here for additional data file.

Protein-protein docking scoring results (part 1).

Recalculation of protein-protein docking interface scores (ΔGbind) for three different Rosetta scoring functions: REF2015, Rosetta-ICO, and Rosetta-ECO. Data points represent the average of three runs with the standard deviation as error bars. The average Boltzmann discrimination scores +/- standard deviation for each distribution is found in the bottom right corner of each plot. (TIF) Click here for additional data file.

Protein-protein docking scoring results (part 2).

Protein-protein docking scoring results (part 3).

Protein-protein docking scoring results (part 4).

Protein-protein docking scoring results (part 5).

Protein-ligand docking scoring results (part 1).

Recalculation of protein-ligand docking interface scores (ΔGbind) for three different Rosetta scoring functions: REF2015, Rosetta-ICO, and Rosetta-ECO. Data points represent the average of three runs with the standard deviation as error bars. The average Boltzmann discrimination scores +/- standard deviation for each distribution is found in the bottom right corner of each plot. (TIF) Click here for additional data file.

Protein-ligand docking scoring results (part 2).

Protein-ligand docking scoring results (part 3).

Protein-ligand docking scoring results (part 4).

1N2J binding mode with Rosetta-ECO model.

The near native Rosetta-ECO model is in thicker stick representation with full-atom water molecules and the ligand depicted in pink. The experimental ligand (pantoate) position is in transparent blue, water oxygen positions as green spheres, and native side chains are in black wire representation. If the native ligand or side chain positions cannot be seen, it is because they are obscured by the Rosetta model. Panel A highlights the overall binding pocket, while panels B-D focus on recovered water positions. (TIF) Click here for additional data file.

1U4D binding mode with Rosetta-ECO model.

The near native Rosetta-ECO model is in thicker stick representation with full-atom water molecules and the ligand depicted in pink. The experimental ligand (debromohymenialdisine) position is in transparent blue, water oxygen positions as green spheres, and native side chains are in black wire representation. If the native ligand or side chain positions cannot be seen, it is because they are obscured by the Rosetta model. Panel A highlights the overall binding pocket, while panels B and C focus on recovered water positions. (TIF) Click here for additional data file.

Comparison of docking scores/energies for conformations sampled By GOLD for select cases.

The RMSD of the ligand from the experimental conformation is plotted against the computed score (ChemPLP) for GOLD and ΔGbind for Rosetta. Note that the sampling from GOLD is often focused in small number of docking conformations, leaving gaps in the sampled space. (TIF) Click here for additional data file.

Derivation of a statistical water potential.

Upper left: Distribution of waters about histidine residues over a range of distance from the HD1 atom and a range of angles from the HD1 and ND1 atoms [-log(HISHD1_ND1)] Upper right: Distribution of waters about a non-polar reference [log(ALAHB1_CB1)] Lower left: The sum of the upper two figures: the statistical potential for histidine Lower right: Final, modified histidine potential filtered for noise and second solvation shell effects. (PNG) Click here for additional data file.

Sample statistics of waters about peptide C = O groups.

Upper right: distance and angle of all waters measured (grey) and those used for statistical placement about the polar group (purple). Bottom left: Angle and dihedral distribution with histogram projections in upper left (angle) and lower right (dihedral). (PNG) Click here for additional data file.

Position of cluster representatives for solvation of C = O backbone groups.

The crystallographic water positions used for statistical placement of potential solvation sites about C = O backbone polar groups are shown here in red, with the k-means cluster centroids (k = 10) illustrated in yellow. Two views of these data are shown about an arbitrary alanine residue. (PNG) Click here for additional data file.

Rotamer recovery error as a function of native water positions randomly perturbed.

Crystallographic water molecules in our benchmark set were randomly perturbed 0.0 to 1.6 Å and the interface residues were repacked in Rosetta. Data points represent the average of three independent runs with 95% confidence interval error bars. The baseline of packing the interfaces without any water molecules (REF2015 score function) is shown as a dashed grey line with 95% confidence intervals from three runs shaded in light grey. (PNG) Click here for additional data file.

GOLD docking and Rosetta rescoring results of Astex Diverse Set.

(DOCX) Click here for additional data file.

3D-RISM results on interface water test set.

(DOCX) Click here for additional data file.

Timing comparison between Rosetta-ECO and 3D-RISM.

(DOCX) Click here for additional data file.

Rosetta force field parameters.

Final parameters used in Rosetta-ICO and ECO force fields. (DOCX) Click here for additional data file.

Dataset information.

Details on datasets used for water recovery and docking discrimination tests, including GOLD docking protocol. (DOCX) Click here for additional data file.

Additional information.

Includes information about CAPRI Target 47, details on the derivation of low-resolution statistical water potential, and more information on the force field parameter training tasks. (DOCX) Click here for additional data file.

XML script.

RosettaScripts XML file used for protein-protein / protein-ligand interface scoring with explicit water molecules (Rosetta-ECO). (DOCX) Click here for additional data file. 30 Jan 2020 Dear Dr. DiMaio, Thank you very much for submitting your manuscript "Efficient consideration of coordinated water molecules improves computational protein-protein and protein-ligand docking discrimination" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Björn Wallner Associate Editor PLOS Computational Biology Mona Singh Methods Editor PLOS Computational Biology *********************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: Most protein structure prediction, docking and design methods use implicit solvent for efficiency, but it is well known that specific waters can appear in crystallographic interfaces and folds. One way to keep the efficiency of implicit models is to include only a few explicit waters in a hybrid model, but this approach proved too slow in prior attempts. Here, this paper presents two new methods to model coordinated waters in biomolecular structures. The first, ICO, is still implicit but examines overlap of virtual water placement sites. The second, ECO, is semi-explicit and places three-atom waters using hash tables and Monte Carlo approaches to keep it fast. The results show that it is still quite difficult to place crystallographic waters (under 18% recovered with under 18% precision), but nevertheless this reduces side-chain placement error modestly. More importantly, it clearly improves discrimination of protein-protein and protein-ligand docking poses upon refinement, showing that, as has been conjectured, the lack of explicit waters at interfaces is one of the limiting factors in docking. The paper marks a significant advance in methods and should be published. I have a few small points on the main paper, and several clarifications needed in the methods. My long list of methods questions is to help understand this calculation and is not an indication of shortcomings of the main work. Main text / major comments: 1. The CAPRI challenge included a water-placement target a few years ago. It would be interesting to see how these methods could perform on that target, and it would provide a clear metric of these methods against a large set of methods in the field. See https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4582081/ for details; at CAPRI meetings, Lensink has even offered to provide the evaluation scripts as needed. 2. Please add a couple more sentences to the main text describing the main hypotheses and concepts behind the two models. 3. In Figure 1A-D, the coordinate system is not clear. What are we looking at here? 4. Binding energy: are the separated components repacked or relaxed? I think not from the methods, please specify (line 470). (Chaudhury found that not relaxing shows a clearer signal for docking; he called it “interface energy” to distinguish it from the relaxed-monomer case which is more like nature.) Also are waters in the separated components placed all over the monomers or just at the binding interface? It was nice that the method buries about the same number of waters at interfaces as Janin counted in crystals; can you also compare the waters at the separated interfaces and are they also like Janin or another reference of structured waters? 5. It seems like the docking is done with a rigid protein backbone. Please confirm and make sure that is clear in the paper. Or, if there is backbone motion, describe and explain its impact on the results. Methods questions: 6. Table 2: which rmsd measure is used? Lrmsd? Irmsd? Atom selection? Please define in methods. 7. Line 176: In the benchmark set, the 53 protein-protein interfaces are categorized as rigid, moderate or flexible backbone targets. Which were chosen? Is Rosetta-ICO and Rosetta-ECO more or less helpful on any of these categories? 8. Line 177: As per line 466 and comment 4 above, use “interface energy” 9. Line 339: Add units on D0 and S0. I suggest replacing S0 with S0^2 in the formula so that both parameters can have dimensions of length. This would help make it physically interpretable. 10. Line 335: Why was the form of G chosen? It seems like it might be very sharp at the minimum. Add a plot with the final values of D0 and S0 and units. 11. What is the reference energy for the lk-bridge term? As atoms move away, the energy goes to infinity! How is this handled? 12. Line 340-341: I get how the second term targets for distance, but the first term seems to look to superpose the two water molecules. I guess ‘angle’ comes out of that, but I suggest rephrasing this sentence. 13. Line 343: Please add a table of all the new parameters, to compare with the one in Alford et al and show the differences from the old set. Also, in the text, please give the nominal values for the new fitted parameters S0 and D0. 14. Line 346: I’m not sure what ‘quite reasonable’ means. Change to state the fraction of time that a donor or acceptor violates its number of physically reasonable bonds. 15. Line 348: citation doesn’t seem right. 16. Line 355: it took me a long time to figure out what was meant by treating the ‘water asymmetrically’. Maybe add another example (in addition to the serine example) with the water atoms as an example. If I’m following correctly, the water will not even have a pseudo-torsion energy, correct? 17. Line 355: what is M? Please define with final value and units. Also I think the sign might be wrong here. 18. E_ref is a poor name for the energy of releasing a water to the bulk because this symbol is already used in Rosetta many times. How about E_bulk or E_water_release or E_WS for water/entropy…anything that is more directly tied to its meaning here. 19. I wasn’t sure the “lk” suffix is useful on the bridge energy term name. Sure, it’s based on the Lazaridis-Karplus base model, but this is a whole new water approach. Is this just so it sorts in an alphabetical list near the other solvation terms? Maybe beyond the scope here, but maybe all solvation terms should be renamed E_solv_*, where * is lk or otherwise. 20. Is the lk_ball model used in ICO or ECO? 21. Line 391, what is the capital W about? Should that be a script W to denote the set of waters placed? 22. Line 391, replace d() with a norm function as in the earlier equation. 23. Line 394 add a small picture; I presume the angle is between w, a and b? 24. Line 397 give the final numbers for sigma and K with units. 25. Line 410, why not nitrogen backbone H-bonding sites too? Are side chain sites enumerated only from oxygens, not nitrogens? 26. Paragraph starting line 411, I don’t quite get the meaning and how this calculation is done. If rot_i1 clashes with rot_j1 but does not clash with rot_j2, what happens? When you detect overlaps in line 414 it sounds like it’s about sc-sc overlaps, but line 417 seems to talk about w-sc overlaps too? Please clarify as this hash table approach seems to be a key to speeding the calculation relative to the older solvated rotamers approaches. What’s in the hash table? Water positions? 27. For calculating ‘dwell time,’ is that only at the low temperature? After an equilibration period each time? 28. For the ECO monte carlo, please specify the MC move set and the fractional probability of calling each move type and the sizes of the moves (e.g. translations and rotations). The fold tree seems important too…do the waters move with one or both docking partners? Maybe all molecules (protein, ligand, and water) tie to a virtual origin point to move independently? 29. Line 466: Binding energies imply a transition from the unbound form of receptors and ligands to the bound form of the complex. The energy calculations described involve the energy of bound conformations of receptors and ligands as they form the binding complex, and would be better defined as the “interface” (or interaction) energy Grammatical edits 1. In all the equations, some terms are xyz vectors and some are scalars. Consider using boldface for the xyz vectors to make them easier to interpret. 2. Line 142, add “two” before “predicted water sites” 3. Line 128: “implemented” 4. Line 224: change “scores” to “discrimination scores”? Also in Supp. Figure captions. 5. Line 319: not only account for desolvation penalties, but “also” energetically reward …. 6. Line 355 I think “replaced” should be “was replaced”? 7. Line 405: “conformations” 8. Line 414 “challenge” 9. Line 424 add “kcal/mol” units to RT 10. Line 459 capitalize “Placevent” 11. Line 444 should probably cite the separate Weng paper that just presents the docking benchmark Reviewer #2: The manuscript by Pavlovicz et al. focuses on prediction of water molecules in protein-protein and protein-ligand complexes by implementing new methods in the Rosetta program. An impressive and original feature of the approach is that water positions and side chain rotamers are sampled at the same time, whereas most previous approaches have used a rigid protein. The authors also evaluate the performance of predictions of protein-protein and protein-ligand complexes. The research topic is relevant and important – waters are important for molecular recognition and accurate positioning of water should improve the performance of scoring functions. The results are mixed. In some cases, an improvement is observed with the waters included, in other cases it has no effect. I think that further analysis of the results is needed. I recommend that the following changes are made to the manuscript: (1) Page 3: The authors write that their approach yields “superior results” at the end of the introduction. Compared to what? (2) Page 4: I was surprised to read that only 17.7% of the native water molecules were predicted by the method for protein-protein interfaces. With previous methods (e.g. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0032036, https://pubs.acs.org/doi/10.1021/ci200150p) that are likely as fast or faster than the presented one, 76-97% of waters in crystal structures are predicted. These studies focus on proteins (not protein-protein complexes) and likely use rigid side chains, but can that explain the substantially lower performance of the method presented in this work? Or is there some other difference in the assessments (I think there might be some)? I recommend that the authors validate their method in the same way as these previous studies (e.g. with static side chains and same cutoffs) and comment on potential differences in the results. This analysis should not only include protein-protein complexes, but also protein-ligand complexes. This will give the reader a better idea of the performance. (3) Page 7: The authors compare they results to 3D-RISM, which gives better results, but is much slower (20-fold). However, there are many very fast methods too, which have not been mentioned (see previous comment). The relative performance of the methods and their advantages/disadvantages should also be mentioned in the discussion. (4) Page 9: The improvement of docking performance for Rosetta-ECO is very encouraging. However, the low prediction accuracy for waters (page 4) makes me question if the results are “right for the right reason”. The authors should quantify if the top-ranked complexes for which an improvement is observed also has accurately placed waters compared those observed in the crystal structure. (5) Related to the previous question, in Figure 3C-D: Can you compare to the crystal structure water positions and side chain conformations in this case? How many of the water predictions are confirmed? Inspection of 1X8X revealed a phosphate close to the ligand. ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: No: Some parameter values requested to add in the methods. (software is available) Reviewer #2: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Jeffrey J. Gray and a student Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see 3 Apr 2020 Submitted filename: reviewer_comments.pdf Click here for additional data file. 8 Jun 2020 Dear Dr. DiMaio, Thank you very much for submitting your manuscript "Efficient consideration of coordinated water molecules improves computational protein-protein and protein-ligand docking discrimination" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Björn Wallner Associate Editor PLOS Computational Biology Jason Papin Editor-in-Chief PLOS Computational Biology *********************** A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: Thank you for the CAPRI target analysis. This is quite nice. I think you should include it in the paper, maybe even summarize the result in the abstract, because it places your work within the context of the field. At a minimum, put the text in your reviewer response in the Supplement. All other concerns are well addressed in the text. Thank you for being so careful with the theory, methodological details, and presentation. The method is very exciting and is sure to be used by many, teaching us new things about the role of specific waters in structures. Reviewer #2: The authors have addressed most of my questions, and only a minor addition is required. I am puzzled about why the authors think that comparison to simplified methods (e.g. WaterDock, WaterMap, AcquaAlta) is not so relevant. I think the authors should extend the clarifying section about this in the discussion and explicitly mention the results (e.g. % reproduced water) achieved by such methods. The difference is interesting for the reader and may influence the choice of method to apply in applications. The authors could provide some advice on this also based on the results. ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Jeffrey J. Gray & Ameya Harmalkar Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see 17 Jun 2020 Submitted filename: reviewer_comments_2.pdf Click here for additional data file. 29 Jun 2020 Dear Dr. DiMaio, We are pleased to inform you that your manuscript 'Efficient consideration of coordinated water molecules improves computational protein-protein and protein-ligand docking discrimination' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Björn Wallner Associate Editor PLOS Computational Biology Mona Singh Methods Editor PLOS Computational Biology *********************************************************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: Thank you for including the CAPRI analysis in the paper, it will be a nice reference for others. Reviewer #2: My points have been addressed and I recommend acceptance. ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Jeffrey J. Gray Reviewer #2: No 28 Aug 2020 PCOMPBIOL-D-19-01956R2 Efficient consideration of coordinated water molecules improves computational protein-protein and protein-ligand docking discrimination Dear Dr DiMaio, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Matt Lyles PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

33 in total

1. ZDOCK: an initial-stage protein-docking algorithm.

Authors: Rong Chen; Li Li; Zhiping Weng
Journal: Proteins Date: 2003-07-01

2. How fast-folding proteins fold.

Authors: Kresten Lindorff-Larsen; Stefano Piana; Ron O Dror; David E Shaw
Journal: Science Date: 2011-10-28 Impact factor: 47.728

3. Empirical scoring functions for advanced protein-ligand docking with PLANTS.

Authors: Oliver Korb; Thomas Stützle; Thomas E Exner
Journal: J Chem Inf Model Date: 2009-01 Impact factor: 4.956

4. Systematic placement of structural water molecules for improved scoring of protein-ligand interactions.

Authors: David J Huggins; Bruce Tidor
Journal: Protein Eng Des Sel Date: 2011-07-19 Impact factor: 1.650

5. A "solvated rotamer" approach to modeling water-mediated hydrogen bonds at protein-protein interfaces.

Authors: Lin Jiang; Brian Kuhlman; Tanja Kortemme; David Baker
Journal: Proteins Date: 2005-03-01

6. The effect of water displacement on binding thermodynamics: concanavalin A.

Authors: Zheng Li; Themis Lazaridis
Journal: J Phys Chem B Date: 2005-01-13 Impact factor: 2.991

7. Simultaneous Optimization of Biomolecular Energy Functions on Features from Small Molecules and Macromolecules.

Authors: Hahnbeom Park; Philip Bradley; Per Greisen; Yuan Liu; Vikram Khipple Mulligan; David E Kim; David Baker; Frank DiMaio
Journal: J Chem Theory Comput Date: 2016-11-07 Impact factor: 6.006

8. Are Protein Force Fields Getting Better? A Systematic Benchmark on 524 Diverse NMR Measurements.

Authors: Kyle A Beauchamp; Yu-Shan Lin; Rhiju Das; Vijay S Pande
Journal: J Chem Theory Comput Date: 2012-03-12 Impact factor: 6.006

9. Solvated docking: introducing water into the modelling of biomolecular complexes.

Authors: Aalt D J van Dijk; Alexandre M J J Bonvin
Journal: Bioinformatics Date: 2006-08-09 Impact factor: 6.937

10. Binding MOAD, a high-quality protein-ligand database.

Authors: Mark L Benson; Richard D Smith; Nickolay A Khazanov; Brandon Dimcheff; John Beaver; Peter Dresslar; Jason Nerothin; Heather A Carlson
Journal: Nucleic Acids Res Date: 2007-11-30 Impact factor: 16.971

10 in total

1. Force Field Optimization Guided by Small Molecule Crystal Lattice Data Enables Consistent Sub-Angstrom Protein-Ligand Docking.

Authors: Hahnbeom Park; Guangfeng Zhou; Minkyung Baek; David Baker; Frank DiMaio
Journal: J Chem Theory Comput Date: 2021-02-12 Impact factor: 6.006

Review 2. Protein-Protein Docking: Past, Present, and Future.

Authors: Sharon Sunny; P B Jayaraj
Journal: Protein J Date: 2021-11-17 Impact factor: 2.371

3. Congenital X-linked neutropenia with myelodysplasia and somatic tetraploidy due to a germline mutation in SEPT6.

Authors: Raffaele Renella; Katelyn Gagne; Ellen Beauchamp; Jonathan Fogel; Aleksej Perlov; Mireia Sola; Thorsten Schlaeger; Inga Hofmann; Akiko Shimamura; Benjamin L Ebert; Klaus Schmitz-Abe; Kyriacos Markianos; Kristi Murphy; Liang Sun; Shira Rockowitz; Piotr Sliz; Dean R Campagna; Timothy A Springer; Christopher Bahl; Suneet Agarwal; Mark D Fleming; David A Williams
Journal: Am J Hematol Date: 2021-11-03 Impact factor: 10.047

4. De novo metalloprotein design.

Authors: Matthew J Chalkley; Samuel I Mann; William F DeGrado
Journal: Nat Rev Chem Date: 2021-12-06 Impact factor: 34.571

5. Dissecting the stability determinants of a challenging de novo protein fold using massively parallel design and experimentation.

Authors: Tae-Eun Kim; Kotaro Tsuboyama; Scott Houliston; Cydney M Martell; Claire M Phoumyvong; Alexander Lemak; Hugh K Haddox; Cheryl H Arrowsmith; Gabriel J Rocklin
Journal: Proc Natl Acad Sci U S A Date: 2022-10-03 Impact factor: 12.779

6. PlaceWaters: Real-time, explicit interface water sampling during Rosetta ligand docking.

Authors: Shannon T Smith; Laura Shub; Jens Meiler
Journal: PLoS One Date: 2022-05-31 Impact factor: 3.752

7. Distinct genetic pathways define pre-malignant versus compensatory clonal hematopoiesis in Shwachman-Diamond syndrome.

Authors: Akiko Shimamura; R Coleman Lindsley; Alyssa L Kennedy; Kasiani C Myers; James Bowman; Christopher J Gibson; Nicholas D Camarda; Elissa Furutani; Gwen M Muscato; Robert H Klein; Kaitlyn Ballotti; Shanshan Liu; Chad E Harris; Ashley Galvin; Maggie Malsch; David Dale; John M Gansner; Taizo A Nakano; Alison Bertuch; Adrianna Vlachos; Jeffrey M Lipton; Paul Castillo; James Connelly; Jane Churpek; John R Edwards; Nobuko Hijiya; Richard H Ho; Inga Hofmann; James N Huang; Siobán Keel; Adam Lamble; Bonnie W Lau; Maxim Norkin; Elliot Stieglitz; Wendy Stock; Kelly Walkovich; Steffen Boettcher; Christian Brendel; Mark D Fleming; Stella M Davies; Edie A Weller; Christopher Bahl; Scott L Carter
Journal: Nat Commun Date: 2021-02-26 Impact factor: 14.919

8. Computational design of a neutralizing antibody with picomolar binding affinity for all concerning SARS-CoV-2 variants.

Authors: Bo-Seong Jeong; Jeong Seok Cha; Insu Hwang; Uijin Kim; Jared Adolf-Bryfogle; Brian Coventry; Hyun-Soo Cho; Kyun-Do Kim; Byung-Ha Oh
Journal: MAbs Date: 2022 Jan-Dec Impact factor: 5.857

9. Computational Site Saturation Mutagenesis of Canonical and Non-Canonical Amino Acids to Probe Protein-Peptide Interactions.

Authors: Jeffrey K Holden; Ryan Pavlovicz; Alberto Gobbi; Yifan Song; Christian N Cunningham
Journal: Front Mol Biosci Date: 2022-04-14

10. Assessing multiple score functions in Rosetta for drug discovery.

Authors: Shannon T Smith; Jens Meiler
Journal: PLoS One Date: 2020-10-12 Impact factor: 3.240

10 in total