Literature DB >> 30648154

Accurate Estimation of Ligand Binding Affinity Changes upon Protein Mutation.

Matteo Aldeghi¹, Vytautas Gapsys¹, Bert L de Groot¹.

Abstract

The design of proteins with novel ligand-binding functions holds great potential for application in biomedicine and biotechnology. However, our ability to engineer ligand-binding proteins is still limited, and current approaches rely primarily on experimentation. Computation could reduce the cost of the development process and would allow rigorous testing of our understanding of the principles governing molecular recognition. While computational methods have proven successful in the early stages of the discovery process, optimization approaches that can quantitatively predict ligand affinity changes upon protein mutation are still lacking. Here, we assess the ability of free energy calculations based on first-principles statistical mechanics, as well as the latest Rosetta protocols, to quantitatively predict such affinity changes on a challenging set of 134 mutations. After evaluating different protocols with computational efficiency in mind, we investigate the performance of different force fields. We show that both the free energy calculations and Rosetta are able to quantitatively predict changes in ligand binding affinity upon protein mutations, yet the best predictions are the result of combining the estimates of both methods. These closely match the experimentally determined ΔΔG values, with a root-mean-square error of 1.2 kcal/mol for the full benchmark set and of 0.8 kcal/mol for a subset of protein systems providing the most reproducible results. The currently achievable accuracy offers the prospect of being able to employ computation for the optimization of ligand-binding proteins as well as the prediction of drug resistance.

Entities: CellLine Chemical Disease Gene Species

Year: 2018 PMID： 30648154 PMCID： PMC6311686 DOI： 10.1021/acscentsci.8b00717

Source DB: PubMed Journal: ACS Cent Sci ISSN： 2374-7943 Impact factor: 14.553

Introduction

Ligand-binding proteins play essential roles in living organisms, with interactions between small organic molecules and proteins triggering a multitude of signal transduction processes.[1−3] Given their high affinity and selectivity, nontoxicity, and biodegradability, the design of proteins with novel ligand-binding functions also holds great potential for application in biomedicine and biotechnology.[4−10] Fast computational approaches that rely on mixed physics- and knowledge-based potentials, such as Rosetta,[11,12] have already proven successful in the early stages of the discovery process. For instance, Tinberg et al.[13] engineered protein binders to the steroid digoxigenin by first designing a minimal ligand-binding shell, then searching the protein data bank[14] (PDB) for suitable scaffolds, and finally optimizing the designs experimentally. Among the 17 computationally designed proteins, two were binding to digoxigenin in the micromolar range. After experimental optimization of the most promising design, a protein with sub-nanomolar affinity for the steroid was found. A similar approach has also been employed for the design of artificial enzymes by using a model of the transition state as the target ligand.[15−18] Other design approaches that have been proposed involve ligand docking to known protein structures[19] or de novo design of short protein sequences using a combination of docking, molecular dynamics (MD), and Monte Carlo simulations.[20] However, our ability to engineer ligand-binding proteins (e.g., biosensors and enzymes) is still limited, and current approaches rely heavily on experimentation, in particular at the optimization stage.[21−24] The limitations of Rosetta have often been ascribed to a limited treatment of backbone flexibility and lack of explicit solvation, so that computational approaches that tackle these challenges may provide a more accurate estimation of ligand affinity changes upon protein mutation.[4,22,25] Free energy calculations based on first-principles statistical mechanics that make use of nonphysical (i.e., alchemical) pathways in a thermodynamic cycle have become increasingly popular in small molecule drug discovery for the optimization of lead compounds[26,27] and have now started being used prospectively by the pharmaceutical industry.[28] Recently, alchemical free energy calculations have also shown promise for the prediction of protein thermostability and drug resistance.[24,29−33] These calculations naturally take into account the full flexibility of the protein–ligand complex and the discrete nature of the solvent, and return the exact binding affinity changes given the energy function (i.e., force field) used. These calculations could be incorporated into protein design pipelines; however, their quantitative performance for the prediction of ligand-binding affinity changes upon protein mutation is largely unclear at this stage. Here, we assess the ability of alchemical free energy calculations to quantitatively predict ligand binding affinity changes upon protein mutation on a challenging set of 134 mutations across 17 proteins and 27 ligands. We adopt an approach that calculates the nonequilibrium work associated with the alchemical transformation of protein side-chains (Figure a), using pmx[61] to build the hybrid structures and topologies. The computational efficiency of different setup protocols is first evaluated, and then the performance of multiple force fields is tested. We find that investing approximately equal amounts of simulation time in the equilibrium and nonequilibrium parts of the calculations results in the most efficient protocols, and that given a fixed amount of computational resources it is beneficial to average results from multiple force fields in a consensus approach. When compared to the experimental data, the free energy calculations achieve root-mean-square errors (RMSEs) as low as ∼1.3 kcal/mol, and ∼1.0 kcal/mol when excluding a few systems posing convergence challenges. Finally, we compare the MD results to those obtained with multiple Rosetta protocols, representing state-of-the-art approaches for protein design.[4,13,15−18,21] We find that a Rosetta protocol (flex_ddg)[34] recently proposed for studying protein–protein binding performs well also for protein–ligand binding. Simple averaging of Rosetta and MD results produced an improved performance for most combinations of force fields and Rosetta scoring functions, with the best combinations returning RMSEs of ∼1.2 kcal/mol for the full data set and ∼0.8 kcal/mol for a subset of well-converged results.

Figure 1

Overview of the benchmark data set studied. (a) Thermodynamic cycle showing the quantity to be predicted (ΔΔGbind); the free energy differences estimated via alchemical free energy calculations are highlighted in red. (b) Statistics of the data set about the protein–ligand systems and type of mutations considered. (c) Cartoon representation of the 17 protein systems present in the data set, with the number of affinity changes upon mutation reported. Ligands and cofactors are represented by spheres.

Results

To rigorously test the performance of the calculations, we first built a benchmark set consisting of 134 mutations across 17 proteins and 27 ligands (Figure b–c; ligands shown in Figure S1) from the Platinum[35] database (see Methods in the SI). Each mutation has an associated experimental binding free energy difference (ΔΔG) determined by isothermal titration calorimetry (ITC; 119 data points) or surface plasmon resonance (SPR; 15 data points), which we aim to compute starting from the X-ray structure of the wild type protein. The ΔΔG values have a range of 9.5 kcal/mol and are normally distributed (Figure S2) with a mean of 0.2 kcal/mol and standard deviation of 1.5 kcal/mol. Overall, this is a diverse and challenging benchmark set that contains large and flexible ligands, proteins with different folds, some also containing cofactors, and many charge-changing and small-to-large/large-to-small mutations. The results should thus provide a realistic average performance of the calculations across different protein–ligand systems. First, we evaluate how the setup of the calculations and their overall computational cost affect the precision and accuracy of the predictions so to be able to choose an efficient protocol for further calculations. Second, we test and compare multiple force fields. And finally, we compare the performance of the free energy calculations to that of different Rosetta protocols.

Calibration of the Free Energy Protocol

In nonequilibrium free energy calculations, there are three main setup variables affecting the total amount of simulation time and, thus, computational cost: In this first part of the study, we wanted to test how these three variables affect the precision and accuracy of the free energy protocol (the Amber99sb*-ILDN/GAFF2 force field was used at this stage). Thus, we tested protocols with five repeated equilibrium simulations between 1 and 10 ns in length (always after 1 ns of equilibration), a total number of nonequilibrium trajectories between 10 and 500, and five different transition lengths of 20, 40, 50, 80, and 100 ps (Figure a). The work values associated with each nonequilibrium transition were extracted using thermodynamic integration (TI)[36] and then used to estimate the free energy differences with the Bennett’s acceptance ratio (BAR)[37,38] relying on the Crooks Fluctuation Theorem.[39−41] The computationally cheapest protocol required 20.8 ns of simulation time, and the most expensive 400 ns. Accuracy was quantified as the RMSE between calculated and experimental free energies for the whole set of 134 mutations. Precision was quantified as the root-mean-square (RMS) of the standard errors (σ) of the predicted ΔΔGs, with σ being estimated from the five independent simulation repeats. In principle, the precision also determines how reproducible the results are. However, this relies on accurate estimates of the uncertainty for each predicted ΔΔG value. Thus, as a stricter test of the reproducibility of our chosen protocol, we defined and quantified “reproducibility” as the root-mean-square deviation (RMSD) between the ΔΔG values obtained from two independent sets of calculations, where each set comprised five equilibrium runs and 150 nonequilibrium trajectories in both directions considering the whole set of 134 mutations.

Figure 2

Calibration of the nonequilibrium free energy protocol. (a) Space of protocol setup parameters tested. The three axes indicate the length of the equilibrium simulations (five repeats of 1–10 ns), the number of nonequilibrium trajectories spawned from the equilibrium simulations (from 10 to 500), and their length (from 20 to 100 ps). Each mark represents a specific combination of the above three variables, with the color indicating the overall precision of the calculations (RMSσ). Equivalent plot color-coded by accuracy (RMSE) in Figure S3. (b) Scatter plots showing the overall precision and accuracy of different setup protocols that used nonequilibrium trajectories of 80 ps. Isolines indicate the computational cost (in simulation time) for one ΔΔG estimate. A green arrow indicates the protocol that was chosen for further calculations. (c) Reproducibility of the calculations. The scatter plot shows the agreement between two sets of ΔΔG estimates. For the second estimate, both the equilibrium and nonequilibrium parts of the calculations were repeated. On the bottom-right corner of the plot, the RMSD between the repeated calculations is shown. (d) Reproducibility of the calculations when increasing the number of independent equilibrium simulations to 10. Also in this case, both the equilibrium and nonequilibrium parts of the calculations were repeated. (e) Reproducibility of the nonequilibrium part of the calculations. In this case, two sets of nonequilibrium transitions were started from the same equilibrium simulations. (f) Reproducibility of the calculations (both equilibrium and nonequilibrium) for a subset of the data with four challenging protein systems excluded.

the length of the equilibrium simulations of the end-states (the bound and unbound forms of the wild type and mutant protein; Figure a); the total number of nonequilibrium trajectories that are spawned from the equilibrium simulations; and the length of these nonequilibrium trajectories, which is proportional to the speed of the alchemical transformation. Calibration of the nonequilibrium free energy protocol. (a) Space of protocol setup parameters tested. The three axes indicate the length of the equilibrium simulations (five repeats of 1–10 ns), the number of nonequilibrium trajectories spawned from the equilibrium simulations (from 10 to 500), and their length (from 20 to 100 ps). Each mark represents a specific combination of the above three variables, with the color indicating the overall precision of the calculations (RMSσ). Equivalent plot color-coded by accuracy (RMSE) in Figure S3. (b) Scatter plots showing the overall precision and accuracy of different setup protocols that used nonequilibrium trajectories of 80 ps. Isolines indicate the computational cost (in simulation time) for one ΔΔG estimate. A green arrow indicates the protocol that was chosen for further calculations. (c) Reproducibility of the calculations. The scatter plot shows the agreement between two sets of ΔΔG estimates. For the second estimate, both the equilibrium and nonequilibrium parts of the calculations were repeated. On the bottom-right corner of the plot, the RMSD between the repeated calculations is shown. (d) Reproducibility of the calculations when increasing the number of independent equilibrium simulations to 10. Also in this case, both the equilibrium and nonequilibrium parts of the calculations were repeated. (e) Reproducibility of the nonequilibrium part of the calculations. In this case, two sets of nonequilibrium transitions were started from the same equilibrium simulations. (f) Reproducibility of the calculations (both equilibrium and nonequilibrium) for a subset of the data with four challenging protein systems excluded. The more expensive protocols tended to return more precise (Figure a) and accurate (Figure S3) results. As there seemed to be no more benefits in terms of accuracy when increasing transition lengths from 80 to 100 ps (Figure S4), we focused on the 80 ps data. Figure b shows the behavior of the precision and accuracy of the calculations when using protocols with different equilibrium simulation lengths and number of nonequilibrium trajectories. We observed that, given a fixed amount of simulation time, exchanging equilibrium for nonequilibrium sampling (i.e., moving across the isolines of computational cost) resulted in both improved precision and accuracy. The most effective protocols involved investing about the same amount of simulation time in the equilibrium and nonequilibrium part of the calculations. In addition, the precision and accuracy of the predictions improved quickly from 10 to 100 nonequilibrium trajectories, after which further improvements came at a higher cost. Surprisingly, on average, we did not see a strong association between the length of the equilibrium trajectories and the accuracy or precision of the calculations. On the basis of this analysis, one of the cheaper protocols investing about half of the simulation time in nonequilibrium sampling was chosen for further testing, and specifically the protocol using 150 nonequilibrium trajectories of 80 ps in length, spawned from five equilibrium simulations of 3 ns (equivalent to 108 ns of total simulation time per ΔΔG calculation; Figure b). As a strict test of reproducibility, we repeated all calculations (including building the simulation systems) with this protocol and measured the RMSD between the ΔΔG values obtained from two independent sets of calculations (Figure c–f). Random ion placement during system setup was found to negatively affect reproducibility (Figure S5); therefore, we first updated our protocol to resolve this issue: ions were not allowed to be placed in the vicinity of the protein, and each equilibrium simulation was started from a different ion configuration. With this precaution, the RMSD between two repeated calculations was 1.40 kcal/mol (Figure c), which was above the target RMSD of 1 kcal/mol we wanted to reach. To improve the precision and reproducibility of the calculations, we resorted to a fourth setup variable: the number of equilibrium simulation repeats (initially set to five). This variable was not originally screened systematically because, assuming that each equilibrium simulation is independent, the precision of each ΔΔG estimate (and the RMSD between two sets of repeated calculations) should drop with the square root of the number of simulation repeats, which is also faster than when adjusting the other three setup variables studied (Figure b). By using 10 repeated equilibrium simulations and 300 nonequilibrium trajectories (i.e., keeping the same ratio of equilibrium/nonequilibrium sampling, and thus also doubling the cost per ΔΔG calculation to 216 ns), a reproducibility RMSD of ∼1 kcal/mol was achieved (Figure d), as expected with independent simulation repeats. To further investigate the source of uncertainty, we repeated only the nonequilibrium part of the calculations from the same set of equilibrium simulations. A comparison of the results obtained with these two runs (Figure e) revealed that about half (∼0.5 kcal/mol) of the reproducibility RMSD could be attributed to uncertainty in the nonequilibrium sampling (Text S1). Therefore, also based on this observation, it appears that on average it is reasonable to invest approximately equal computation effort in the equilibrium and nonequilibrium parts of the calculations. While we have been discussing protocol performance in average terms (across all protein–ligand systems), different protein systems present different sampling challenges. Indeed, different degrees of reproducibility were observed across systems, with four protein systems being found to be particularly challenging (Text S2 and Table S1). Excluding these systems, the RMSD between two repeated sets of calculations decreased further to ∼0.7 kcal/mol (Figure f). From here on, we will show the results for the full set of 17 protein systems (134 mutations) as well as for the high-reproducibility subset that excludes the four protein systems mentioned above (86 mutations). Also note that from here on we will discuss the accuracy for only one of the two sets of calculations performed with the Amber99sb*-ILDN/GAFF2 force field, as the two sets showed comparable accuracy with respect to experiment (RMSE of 1.070.981.40 kcal/mol versus 1.081.011.37 kcal/mol for the high-reproducibility subset, and of 1.391.291.72 kcal/mol versus 1.501.381.83 kcal/mol for the full data set).

Accuracy of the Calculations and Force Field Comparison

After calibrating the free energy protocol and assessing the reproducibility of the calculations, we evaluated the accuracy of multiple contemporary force fields from the Amber and Charmm families (Figure and Table ). The agreement between calculations and experiments was quantified using three performance measures: the RMSE, the Pearson correlation coefficient, and the area under the receiver operating characteristics curve (AUC-ROC). These measures capture different relationships between the calculated and experimental data: the RMSE measures the deviation between calculated and experimental values such that 68% of calculated ΔΔGs are within one RMSE of the experimental ones; the Pearson correlation coefficient quantifies the linearity of the relationship between calculated and experimental ΔΔGs; the AUC-ROC gauges instead the performance of the approach as a binary classifier. We show the 95% confidence intervals (CIs) of these measures in the relevant figures and tables. The CI was obtained via bootstrap, by random resampling with replacement the data set while at the same time stochastically resampling each ΔΔG estimate assuming a Gaussian distribution (see Methods in the SI). In such a way, the CI incorporates the imprecision of each estimate as well as the uncertainty due to the specific choice/availability of data set. Significant differences were estimated in a similar fashion, such that small p-values are indicative of performance differences that are unlikely to have been the result of the specific choice of data set or the imprecision of the estimates.

Figure 3

Performance of the free energy calculations with different force fields and force field combinations. (a) Scatter plots of experimental versus calculated ΔΔG values. The identity line is shown as a dashed gray line, while the shaded area indicates the region where ΔΔG estimates are within 1.4 kcal/mol of experiment (i.e., within a 10-fold error in Kd change at 300 K). The performance for the high-reproducibility subset of the data is reported at the top-left of each plot, while the performance for the whole data set is shown at the bottom-right. Color-coding is used to indicate the error of each individual ΔΔG estimate. (b) Summary of the performance of the calculations across force fields in terms of RMSE, Pearson correlation, and AUC-ROC (point estimates and the 95% CIs are shown). Differences at three levels of significance are reported using labels within the chart: e.g., a “C36 *” label above the RMSE mark of A99 indicates that the RMSE of A99 is significantly lower (i.e., agreement with experiment is better) than that of C36 at α = 0.10. Marks in solid colors refer to the high-reproducibility subset, while marks in semitransparent colors refer to the full data set.

Table 1

Summary Statistics of the ΔΔG Estimatesa

abbr.	protein FF	ligand FF	#ΔΔG	experimental ΔΔG range (kcal/mol)	RMSE (kcal/mol)	Pearson	AUC-ROC
A99	Amber99sb*-ILDN	GAFF2.1/AM1-BCC	86	6.1	1.07_0.98^1.40	0.37_0.08^0.50	0.57_0.45^0.70
			134	9.5	1.39_1.29^1.72	0.43_0.17^0.54	0.56_0.48^0.67
A99σ	Amber99sb*-ILDN	GAFF2.1/RESP	86	6.1	1.21_1.08^1.54	0.31_0.10^0.44	0.61_0.48^0.72
			134	9.5	1.57_1.43^1.92	0.26_0.05^0.39	0.58_0.47^0.67
A14	Amber14sb	GAFF2.1/AM1-BCC	86	6.1	1.22_1.15^1.60	0.28_0.01^0.42	0.55_0.43^0.68
			134	9.5	1.42_1.35^1.79	0.43_0.20^0.52	0.61_0.50^0.69
C22	Charmm22*	CGenFF 3.0.1	75	4.8	1.09_0.93^1.47	0.32_0.08^0.49	0.70_0.54^0.79
			117	9.1	1.41_1.29^1.76	0.36_0.11^0.50	0.66_0.53^0.73
C36	Charmm36	CGenFF 3.0.1	75	4.8	1.24_1.09^1.58	0.04_–0.22^0.25	0.50_0.36^0.63
			117	9.1	1.57_1.44^1.97	0.27_–0.01^0.43	0.51_0.42^0.63
C36m	Charmm36m	CGenFF 3.0.1	75	4.8	1.23_1.03^1.62	0.22_–0.08^0.42	0.56_0.43^0.69
			117	9.1	1.63_1.46^2.03	0.25_–0.03^0.43	0.52_0.43^0.64
A99σ + C22	Consensus [A99σ + C22]		75	4.8	1.07_0.95^1.39	0.33_0.08^0.47	0.69_0.51^0.77
			117	9.1	1.39_1.26^1.75	0.38_0.11^0.52	0.66_0.52^0.73
A14 + C22	Consensus [A14 + C22]		75	4.8	1.01_0.92^1.40	0.35_0.03^0.52	0.64_0.49^0.75
			117	9.1	1.31_1.22^1.70	0.45_0.16^0.56	0.67_0.54^0.74
C22 + C36	Consensus [C22 + C36]		75	4.8	1.05_0.93^1.37	0.31_0.04^0.46	0.68_0.50^0.77
			117	9.1	1.35_1.25^1.71	0.42_0.16^0.52	0.66_0.53^0.73
ROS	Rosetta (flex_ddg/nov16)		86	6.1	0.99_0.86^1.16	0.39_0.17^0.54	0.61_0.47^0.71
			134	9.5	1.46_1.23^1.73	0.25_0.04^0.43	0.56_0.45^0.65
RMD	Rosetta + MD [ROS and A14 + C22]		75	4.8	0.82_0.74^1.03	0.44_0.11^0.60	0.68_0.51^0.76
			117	9.1	1.23_1.10^1.46	0.49_0.24^0.59	0.67_0.54^0.73

For each set of calculations, the statistics based on both the high-reproducibility and full datasets are shown on the first and second line, respectively. Results obtained with Charmm force fields are based on slightly smaller datasets (#ΔΔG) as glycine mutations were excluded for these force fields (see Methods in the SI). “abbr.”: “FF”: force field; “RMSE”: root mean square error; “AUC-ROC”: area under the receiver operating characteristic curve. Performance of the free energy calculations with different force fields and force field combinations. (a) Scatter plots of experimental versus calculated ΔΔG values. The identity line is shown as a dashed gray line, while the shaded area indicates the region where ΔΔG estimates are within 1.4 kcal/mol of experiment (i.e., within a 10-fold error in Kd change at 300 K). The performance for the high-reproducibility subset of the data is reported at the top-left of each plot, while the performance for the whole data set is shown at the bottom-right. Color-coding is used to indicate the error of each individual ΔΔG estimate. (b) Summary of the performance of the calculations across force fields in terms of RMSE, Pearson correlation, and AUC-ROC (point estimates and the 95% CIs are shown). Differences at three levels of significance are reported using labels within the chart: e.g., a “C36 *” label above the RMSE mark of A99 indicates that the RMSE of A99 is significantly lower (i.e., agreement with experiment is better) than that of C36 at α = 0.10. Marks in solid colors refer to the high-reproducibility subset, while marks in semitransparent colors refer to the full data set. Figure a compares experimental and calculated ΔΔG values for the six force fields tested (abbreviations are explained in Table and Figure b): three from the Amber (A99, A99σ, A14) and three from the Charmm (C22, C36, C36m) family. Figure b shows the performance of these force fields on the high-reproducibility subset of the data (solid color) and on the full data set (semitransparent color), according to the three performance measures described above. The performance is also reported in numerical format in Table . Overall, aside from some exceptions, the different force fields performed similarly, with the lowest RMSE reaching ∼1.1 kcal/mol, but moderate Pearson correlation (up to ∼0.4) and AUC-ROC (up to ∼0.70). RMSEs were between ∼1.4–1.6 kcal/mol for the whole data set, and around ∼1.1–1.2 kcal/mol for the high-reproducibility subset. The Pearson correlations achieved were between 0.22 and 0.43, with the exception of C36, which had a correlation of 0.04 for the reduced data set. C36 and C36m were also the only two force fields that did not achieve a correlation significantly (at α = 0.05) different from zero. In terms of AUC-ROC, the C22 force field stood out as the only one achieving statistical difference from randomness (AUC-ROC of 0.50), with values of 0.66 and 0.70 for the full and the high-reproducibility data set, respectively. On average, the Amber force fields achieved AUC-ROC values slightly below 0.6, while the other Charmm force fields (C36 and C36m) just above 0.5. Note that in these results, the simulations of the apo states were initiated from crystal structures of the protein–ligand complexes. We investigated the effect of starting the simulations from crystal structures of the apo state, where available. However, we did not observe a significant overall improvement (Figure S6). As it has been previously observed that combining results from different force fields in a consensus approach can result in improved performance,[29,42] we explored this simply by averaging the results for pairs of force fields. To consider equivalent computational costs, consensus results were built by averaging ΔΔG estimates obtained with half of the simulations run for each force field (i.e., the first five equilibrium repeats). Then, we compared these consensus results with those of the two parent force fields obtained with the full amount of simulation data (i.e., all 10 equilibrium repeats). Effectively, this exercise answers the following practical question: given the chosen amount of computer time per calculation (in our case, 216 ns), is it more likely to achieve better accuracy by investing all sampling in a single force field or by investing half of the sampling in two separate force fields? Considering all possible combinations of the six force fields considered here, and the three performance measures chosen, we find that it is statistically sensible to use a consensus approach that employs two force fields (Figure S7). In particular, we observed that in ∼44% of the cases the accuracy of the consensus results (in terms of RMSE, Pearson, and AUC-ROC) was better than that of the two parent force fields; in ∼49% of the cases, the performance was in between that of the two parent force fields, and only in ∼7% of cases it was worse than both. When excluding the systems posing convergence challenges, the above percentages further improved to ∼53%, ∼44%, and ∼2%. In Figure and Table we report the data for three of the consensus results (A99σ + C22, A14 + C22, and C22 + C36). The RMSEs for all these three consensus pairs were the same or better than the best RMSE achieved by a single force field (A99), despite not including the data from this force field. In terms of Pearson correlation, the consensus results compare favorably to those obtained with a single force field, with values between 0.31 and 0.45. In particular, all three consensus combinations retained the significant difference over C36, including C22 + C36. The AUC-ROC of the consensus results approaches closely that of C22, being equal or above 0.64 in all cases. Even C22 + C36 shows good performance (0.68/0.66) compared to most single force fields, despite C36 having the worst AUC-ROC among all force fields tested (0.50/0.51).

Performance of Rosetta and Combined Protocols

To compare the results of the free energy calculations to a different approach, and to explore the complementarities of different methods, we tested the performance of Rosetta as it currently is the gold-standard tool for protein design. In total, we tested 11 different combinations of Rosetta protocols and scoring functions (Table S2).[12,34,43,44] However, here we discuss only the results of the best performing protocol (flex_ddg[34]) with one of the latest Rosetta scoring functions (beta_nov16, or here simply “nov16”). This protocol was recently proposed for the prediction of changes in protein–protein affinity upon mutation,[34] and we adapted it to calculate changes in protein–ligand affinity. Despite not being originally intended for this purpose, the protocol performed well and comparably to the more rigorous free energy calculations (Figure a), achieving a RMSE of 0.99 and 1.46 kcal/mol for the high-reproducibility and the full data sets, respectively. For the more reproducible subset of the data, also the Pearson correlation was competitive to MD (0.39), whereas for the full data set it had a lower performance (0.25). The AUC-ROC was slightly inferior to that of the free energy calculations using consensus force-fields (0.61 and 0.56 for the high-reproducibility and the full data sets, respectively), however, not significantly so (Figure c).

Figure 4

Performance of Rosetta protocols. (a) Experimental versus calculated affinity changes for the flex_ddg/nov16 protocol. The identity line is shown as a dashed gray line, while the shaded area indicates the region where ΔΔG estimates are within 1.4 kcal/mol of experiment (i.e., within a 10-fold error in Kd change at 300 K). The performance for the high-reproducibility subset of the data is reported at the top-left of the plot, while the performance for the whole data set is shown at the bottom-right. Color-coding is used to indicate the error of each individual ΔΔG estimate. (b) Experimental versus calculated affinity changes for the consensus results combining the flex_ddg/nov16 results with the free energy calculations results A14 + C22. (c) Summary of the performance of the Rosetta calculations in terms of RMSE, Pearson correlation, and AUC-ROC (point estimate and 95% CIs are shown). Differences at three levels of significance are reported using labels within the chart: e.g., a “A14 + C22 **” label above the RMSE mark of ROS indicates that the RMSE of ROS is significantly lower than that of A14 + C22 at α = 0.05. Marks in solid colors refer to the high-reproducibility subset, while marks in semitransparent colors refer to the full data set. Given that the results obtained with flex_ddg were on the same scale (in terms of energy units) as those obtained with MD, we combined the information from both approaches via simple averaging, similarly to what was done previously for different force fields. While many forms of regression could be used to result in optimal performance on this data set, we focused on the simplest approach to avoid overfitting. The flex_ddg protocol was tested with three scoring functions in total (talaris2014, REF2015, and beta_nov16; see Table S2), and free energy calculations with six force fields. When all possible consensus force field results are considered, there are 63 ways the Rosetta and MD data can be averaged and evaluated with each of the three performance measures employed here (Figure S8). It was found that the performance of these Rosetta + MD consensus results improves upon both parent data in the majority of instances (Figure S8). Specifically, in ∼78% of cases the consensus results were better than both the MD and the Rosetta results, in ∼22% they were in between, while only in ∼0.5% were they worse than both. In Figure b–c we show the consensus results derived from the flex_ddg/nov16 Rosetta protocol and the A14 + C22 free energy calculations as an example (numerical results in Table ). In this case, the RMSE for the high-reproducibility data set drops well below the 1 kcal/mol mark (0.82 kcal/mol), while also the best RMSE for the full data set is obtained (1.23 kcal/mol). Pearson correlation is also the highest among all results for both data sets (0.44 and 0.49). Finally, the AUC-ROC is comparable to that of the best results obtained via free energy calculations (0.68 and 0.67). In addition to MD and Rosetta, we also tested three machine learning algorithms: two trained to predict ligand binding affinity (RF-Score-VS[45] and CSM-Lig(46)) and one specifically trained to predict changes in binding affinity upon protein mutation (mCSM-Lig[47]). Perhaps unsurprisingly, the two algorithms trained to predict affinities performed poorly when applied to the different task of predicting binding affinity changes, as they seemed to be insensitive to a single or few protein mutations. Conversely, the mCSM-Lig algorithm (based on Gaussian process regression) performed well, similarly to the results obtained here by combining the MD and Rosetta data (Figure S9). However, this algorithm was trained on the same data here used for testing (Platinum database[35]), so this is not an independent validation of the approach. In contrast, the physics-based models are trained on simple physical properties such that this data set tests whether they can extrapolate to more complex properties. While machine learning in general, and mCSM-Lig in particular, are certainly promising avenues for the fast estimation of changes in ligand-binding affinity upon protein mutation, other tests on new data sets will be needed to evaluate the performance of such algorithms.

Discussion

In this work we tested multiple approaches for the estimation of ligand binding affinity changes upon protein mutation on a data set of 134 ΔΔG values across 17 proteins and 27 ligands. Given the diverse and challenging nature of the test set, we believe the results shown provide a representative picture of the average performance for the methods tested. Free energy calculations that employed 216 ns per estimate were used to test the accuracy of six different modern force fields. It was shown that combining the results of two force fields in a consensus approach provides better performance than investing all the simulation time in a single force field. In such a way, the free energy calculations achieved RMSEs with respect to experiment as low as ∼1.3 kcal/mol when considering the full data set, and as low as ∼1.0 kcal/mol when excluding four protein systems posing reproducibility challenges. Overall, this performance is in line with what was observed in previous studies.[31,32,48] In particular, Hauser et al.[31] used similar calculations to estimate the effect of Abl kinase mutations on inhibitor affinity in the context of drug resistance: using a data set of 144 ΔΔG values across eight inhibitors, spanning a range of ∼6 kcal/mol, they obtained a RMSE of 1.1 kcal/mol. From these calculations, additional observations that can be useful for the practical application of the methodology can be made. The most efficient free energy protocols invested approximately equal amounts of simulation time in the equilibrium and nonequilibrium parts of the calculations. Random ion placement within the simulation box was found to be detrimental to reproducibility in some instances, as internal water molecules might be replaced by ions, and ions placed in buried protein cavities could bias equilibrium sampling. Despite its simplistic nature, our approach to exclude water molecules directly around the protein to be replaced by ions, and then allow for a short equilibration, was sufficient to solve the reproducibility issue. More sophisticated approaches that place ions in electrostatic potential minima, or a more rigorous treatment of salt conditions,[49] could also obviate the same issue while allowing for starting ion configurations closer to equilibrium. The fact that we did not observe an improvement of the results when initiating the simulations of the apo states from apo-state X-ray crystal structures might suggest that, usually, the structure of the complex provides a reasonable starting structure also for the apo state. While the performance reported is representative of an average across protein–ligand systems, the calculations achieve different performances for specific systems (Figure S10). When focusing on a specific target, it is still recommended to validate the methodology against experimental data whenever possible. The more mutations carried out in a single calculation, the less precise and accurate the estimates were (Figure S11). We did not observe a trend for which the net-charge change of the simulation box correlated with the accuracy of the predictions (Figure S12), supporting the hypothesis that, for this specific application/setup, finite size artifacts due to the use of Ewald summation methods[50] mostly cancel out between the two legs of the thermodynamic cycle such that other sources of error dominate the final errors observed (see also Methods and Figure S13). In this study we also found that a recently proposed Rosetta protocol (flex_ddg) can quantitatively predict changes in ligand affinity upon protein mutation (RMSE of ∼1.0 and ∼1.5 kcal/mol for the reduced and full data sets, respectively). However, we note that there is room for improving the performances of both the free energy and Rosetta calculations. For instance, here we carried out a straightforward adaptation of the flex_ddg protocol that does not sample different ligand conformations; additional adjustments accounting for ligand flexibility might result in higher performance. The efficiency of the nonequilibrium free energy protocol can also be improved by, e.g., considering nonlinear paths for the alchemical transformation or more suitable soft-core potentials.[51] We also note that in this study we used a fairly short equilibration time of 1 ns for all systems, while some may need longer times to equilibrate.[52] In fact, here we studied multiple protein–ligand systems without in-depth knowledge of any of them. When focusing on a specific system of interest, the user is also likely to have prior information that might be used to improve upon the quality and modeling (e.g., on protonation states, or changes in ligand binding poses). Refinement of force field parameters for the ligand molecules could also result in improved accuracy.[53−55] In fact, general advances in the potential energy and scoring functions used,[12,56] and in the quality and speed of sampling,[57−59] can be expected to result in corresponding improvements in the convergence and accuracy of the calculations here described. It is important to put the performances of the calculations in the context of their computational cost. With the free energy calculations, each ΔΔG estimate took between 2 and 5 days on a node with 20 CPU threads and 1 GPU (Intel Xeon E5-2630 v4; GTX 1080 Ti), depending on the size of the system (from ∼30 000 to ∼100 000 atoms). With the flex_ddg protocol, each ΔΔG estimate took up to a day on a single CPU core. It thus emerges that Rosetta and in particular the flex_ddg protocol is likely the most appropriate starting point for a campaign aimed designing ligand-binding proteins or anticipating drug resistance. However, we argue that alchemical free energy calculations are a complementary approach that brings additional value at the optimization stage of the design process. First, the best free-energy-based consensus approaches (e.g., A14 + C22) did return better RMSE, Pearson, and AUC-ROC performance than Rosetta when considering the full data set (Table and Figure c), despite the differences not being significant at the α = 0.10 level. This might suggest that the challenges faced by MD simulations are also present, and possibly to a larger extent, in the Rosetta calculations when considering highly flexible systems. Second, the best performance was obtained only when combining MD and Rosetta results (RMSEs < 1.0 kcal/mol on the high-reproducibility data set; Figure S8). Third, nonequilibrium free energy calculations provide not only a ΔΔG estimate, but also extensive sampling of the four end states of interest (apo and holo simulations for the wild type and mutant proteins). These equilibrium simulations in explicit solvent can be further analyzed and used to either apply additional filters or to rationalize experimental results, as was done for instance by Privett et al.[25] and Kiss et al.[60] Thus, in summary, a possible integrated protocol might include Rosetta calculations for an initial larger screen, followed by refinement of the most promising results via the incorporation of free energy calculations, which would allow for a higher predictive power as well as provide dynamical insight into the effect of the mutations. On the basis of the results obtained here, combining Rosetta and free energy calculation results via simple averaging can achieve RMSEs below 1 kcal/mol when considering protein systems providing well-converged results.

Conclusion

We have shown how both rigorous nonequilibrium free energy calculations based on MD simulations, as well as the latest Rosetta protocols, are able to quantitatively predict changes in ligand binding affinity upon protein mutations. In particular, the best predictions, which were the result of combining the estimate from both methods, closely matched the experimentally determined ΔΔG values, with RMSE of ∼1.2 kcal/mol for the full benchmark set and of ∼0.8 kcal/mol for the set of protein systems showing well-converged results. As these calculations can find direct application for the design of ligand-binding proteins and the prediction of drug resistance, the present results confirm the potential of both physics-based and knowledge-based computational approaches for complementing experimentation in the engineering of biological macromolecules and the design of more robust small molecule drugs.

23 in total

1. Quantitative mapping of protein-peptide affinity landscapes using spectrally encoded beads.

Authors: Jagoree Roy; Björn Harink; Nikhil P Damle; Huy Quoc Nguyen; Naomi R Latorraca; Brian C Baxter; Kara Brower; Scott A Longwell; Tanja Kortemme; Kurt S Thorn; Martha S Cyert; Polly Morrell Fordyce
Journal: Elife Date: 2019-07-08 Impact factor: 8.140

2. Scoring Functions for Protein-Ligand Binding Affinity Prediction using Structure-Based Deep Learning: A Review.

Authors: Rocco Meli; Garrett M Morris; Philip C Biggin
Journal: Front Bioinform Date: 2022-06-17

3. Performance evaluation of molecular docking and free energy calculations protocols using the D3R Grand Challenge 4 dataset.

Authors: Eddy Elisée; Vytautas Gapsys; Nawel Mele; Ludovic Chaput; Edithe Selwa; Bert L de Groot; Bogdan I Iorga
Journal: J Comput Aided Mol Des Date: 2019-11-01 Impact factor: 3.686

4. Large scale relative protein ligand binding affinities using non-equilibrium alchemy.

Authors: Vytautas Gapsys; Laura Pérez-Benito; Matteo Aldeghi; Daniel Seeliger; Herman van Vlijmen; Gary Tresadern; Bert L de Groot
Journal: Chem Sci Date: 2019-12-02 Impact factor: 9.825

5. Protein-ligand binding with the coarse-grained Martini model.

Authors: Paulo C T Souza; Sebastian Thallmair; Paolo Conflitti; Carlos Ramírez-Palacios; Riccardo Alessandri; Stefano Raniolo; Vittorio Limongelli; Siewert J Marrink
Journal: Nat Commun Date: 2020-07-24 Impact factor: 14.919

6. Improvement in predicting drug sensitivity changes associated with protein mutations using a molecular dynamics based alchemical mutation method.

Authors: Fumie Ono; Shuntaro Chiba; Yuta Isaka; Shigeyuki Matsumoto; Biao Ma; Ryohei Katayama; Mitsugu Araki; Yasushi Okuno
Journal: Sci Rep Date: 2020-02-07 Impact factor: 4.379

7. Predicting Kinase Inhibitor Resistance: Physics-Based and Data-Driven Approaches.

Authors: Matteo Aldeghi; Vytautas Gapsys; Bert L de Groot
Journal: ACS Cent Sci Date: 2019-08-13 Impact factor: 14.553

8. The role of NMR in leveraging dynamics and entropy in drug design.

Authors: Abhinav Dubey; Koh Takeuchi; Mikhail Reibarkh; Haribabu Arthanari
Journal: J Biomol NMR Date: 2020-07-27 Impact factor: 2.835

9. Combining free energy calculations with tailored enzyme activity assays to elucidate substrate binding of a phospho-lysine phosphatase.

Authors: Anett Hauser; Songhwan Hwang; Han Sun; Christian P R Hackenberger
Journal: Chem Sci Date: 2020-09-08 Impact factor: 9.825

Review 10. Dynamics, a Powerful Component of Current and Future in Silico Approaches for Protein Design and Engineering.

Authors: Bartłomiej Surpeta; Carlos Eduardo Sequeiros-Borja; Jan Brezovsky
Journal: Int J Mol Sci Date: 2020-04-14 Impact factor: 5.923