Literature DB >> 25691832

KECSA-Movable Type Implicit Solvation Model (KMTISM).

Zheng Zheng¹, Ting Wang, Pengfei Li, Kenneth M Merz.

Abstract

Computation of the solvation free energy for chemical and biological processes has long been of significant interest. The key challenges to effective solvation modeling center on the choice of potential function and configurational sampling. Herein, an energy sampling approach termed the “Movable Type” (MT) method, and a statistical energy function for solvation modeling, “Knowledge-based and Empirical Combined Scoring Algorithm” (KECSA) are developed and utilized to create an implicit solvation model: n class="Chemical">KECSA-Movable Type Implicit Solvation Model (KMTISM) suitable for the study of chemical and biological systems. KMTISM is an implicit solvation model, but the MT method performs energy sampling at the atom pairwise level. For a specific molecular system, the MT method collects energies from prebuilt databases for the requisite atom pairs at all relevant distance ranges, which by its very construction encodes all possible molecular configurations simultaneously. Unlike traditional statistical energy functions, KECSA converts structural statistical information into categorized atom pairwise interaction energies as a function of the radial distance instead of a mean force energy function. Within the implicit solvent model approximation, aqueous solvation free energies are then obtained from the NVT ensemble partition function generated by the MT method. Validation is performed against several subsets selected from the Minnesota Solvation Database v2012. Results are compared with several solvation free energy calculation methods, including a one-to-one comparison against two commonly used classical implicit solvation models: MM-GBSA and MM-PBSA. Comparison against a quantum mechanics based polarizable continuum model is also discussed (Cramer and Truhlar’s Solvation Model 12).

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Solvents
Water

Year: 2015 PMID： 25691832 PMCID： PMC4325602 DOI： 10.1021/ct5007828

Source DB: PubMed Journal: J Chem Theory Comput ISSN： 1549-9618 Impact factor: 6.006

Introduction

Modeling the solvation of molecular species has been of intense interest for many years, in part, because biological processes occur in an aqueous environment. Hence, through the experimental and theoretical understanding of the solvation process in terms of the aqueous structure around a solute and solvation free energies, insights into numerous chemical and biological processes have been obtained. In recent years, many methods have been developed to compute solvation free energies using explicit[1−57] and implicit[58−148] solvent models. Reviewing all the ongoing efforts in this field is beyond the present article, but numerous excellent reviews are available that cover this broad and continually evolving field.[23,66,129] Solvation energy calculations with explicit solvent models are usually carried out with some sort of sampling method, for example, Monte Carlo (MC), molecular dynamics (MD), etc., while implicit solvent models normally treat solvent as a continuum dielectric. Electrostatic energies are then calculated with either classical or QM based methods. In the context of force field based methods, the Poisson–Boltzmann (PB) and Generalized Born (n class="Chemical">GB) models coupled with the hydrophobic solvent accessible surface area (SA) are among the most commonly used classical implicit solvent models.[127−148] They provide reasonable solvation free energies for both small molecules and macromolecules. On the other hand, ab initio computation becomes feasible for solvation energy calculations when solvent is considered as a continuum model; namely, the polarizable continuum models (PCM) with different QM methods,[31,32,35,37,39,46] COSMO,[77] and Cramer and Truhlar’s Solvation Model (SM) series[78−126] are frequently applied in small molecule/ion solvation energy calculations. However, given the central role solvation plays in protein–ligand binding among other biological processes, even modest errors in computed solvation free energies can adversely affect, for example, ligand binding affinity predictions. Hence, there is an ongoing need to improve our ability to model the solvation process using a broad range of methodologies. Herein, we describe an approach that employs statistical potential models coupled with fundamental statistical mechanical methodologies to compute solvation free energies. Besides classical force field models of molecular structure and solvation, statistical potentials have been widely and successfully used to study biological systems.[149−186] Different from force field energy functions, which usually separate the pairwise potential into nonpolar (Lennard-Jones) and electrostatic (Coulomb) potentials, statistical potential methods directly convert observed pairwise probabilities into a pairwise potential by introducing a hypothetical reference state.[149] Chemical environment effects on the pairwise potential are embedded in the number density distributions instead of via electrostatic and Lennard-Jones pairwise interactions. This definition simplifies statistical potentials but increases the error by classifying atoms with significantly different electron densities into the same category.[168,169] Efforts have been made to address this issue by introducing detailed atom type classification schemes or through the use of fragment based statistical potentials.[168,169] Another important issue for statistical potentials is that their performance rely crucially on available structural data. Substantial protein and small molecule structural databases support the success of statistical potentials in many application areas, for example, protein folding[170−186] and protein–ligand binding.[149−169] Structural databases with accurately positioned crystal waters including the Protein Databank (PDB) and the Cambridge Structural Database (CSD) are used herein to build pairwise models of solute–solvent interactions. Statistical potentials, from the theoretical point of view build on the concept of the potential of mean force (PMF), are developed using structural information regarding structurally characterized molecular systems. A mean potential between specific atom pairs (ω(2)(r12)) is directly generated from the frequency of occurrence of the atom pairs contained in a large database of molecules:where g(2) is called a correlation function. β = 1/kBT and kB is the Boltzmann constant and T is the temperature. ρ(r) is the number density for the atom pairs of types i and j observed in the known protein or ligand structures, and ρ(r) is the number density of the corresponding pair in the background or reference state. A central problem for statistical potentials is to model specific atom pairwise interactions removed from the background energy. In protein–ligand complexes, geometric information, that is, atom pairwise radial distributions, represents an averaged effect of all interactions in chemical space, including bond, angle, torsion, and long-range noncovalent forces. Converting these radial distributions into energy functions is a challenge.[169] Computing free energies requires the availability of accurate energy functions but also requires extensive phase space sampling. In the NVT ensemble, the Helmholtz solvation free energy is calculated using the ratio of partition functions given as Equation 2 represents the free energy change of transferring a molecule (L) from vacuum to aqueous solution. Many sampling methods have proven effective, for example, MD and MC methods; however, thoroughly sampling phase space is challenging for brute-force methods. A new sampling method, which we call the Movable Type (MT) method, was developed by our group in an attempt to avoid some of the pitfalls encountered by the more computationally intensive sampling methods. Via sampling of all atom pairwise energies, at all possible distances, using prebuilt databases and then combining these energies for all atom pairs found in the molecular system of interest, the MT sampling method was able to accurately estimate binding free energies as well as protein–ligand poses.[187] In the following paragraphs, we discuss in detail the data selection process, atom type recognition and a novel reference state model that aided in the development of a new solute–solvent statistical potential method, which when combined with the MT sampling method predicts solvation free energies. To validate our model, a curated set of 393 small molecule solvation free energies, collected in the MNSol database of Cramer and Truhlar, was used.[193] Our computed results were then compared to those obtained using the MM-n class="Chemical">GBSA and MM-PBSA models available in AMBER[146−148] against the same data set.

Method

Movable Type Continuum Solvation Model

The Movable Type (MT) method[187] is a free energy method that generates the ensemble of the molecular system of interest using pairwise energies and probabilities. The term “Movable Type” originates from the printing technique where a database of symbols (letters, numerals, etc.) is created and then assembled using a movable type system. Similarly, MT free energy calculations start from the construction of a large database containing interaction energies between all classes of atom pairs found in the chemical space under investigation. An atom pairwise energy function is required to create the database, and the modified KECSA model is employed herein. With an atom pairwise energy database, molecular “printing” is then performed by assembling the pairwise energies using a “printing forme”. A fixed-size matrix (Z-matrix) is introduced to represent the Boltzmann-weighted energy ensemble, in which atom pairwise energies at different distances are assembled to simultaneously represent the ensemble and free energies of the chemical space under investigation. Creation of the Z-matrix starts from the first atom pair in the observed molecular system, with all elements in the matrix as Boltzmann-weighted energies of the observed atom pair at different distances, selected from the energy database. (Z-matrix) in eq 3 represents a Boltzmann-weighted energy (Boltzmann factor) matrix for the kth atom pair in the observed molecule L containing energies ranging from distance r1 to r. A pointwise product of Z-matrices results in the Boltzmann-weighted energy combinations between different atom pairs at different distances with a sampling size of n (matrix size), as is shown in eq 4.where the atom pairwise energy of 1 and 2 are sampled within their respective distance ranges (r and r′). Performing random disordered permutations to the Z-matrices maximizes the variety of energy combinations at different distances. Based on the assumption that the molecular energy is separable into atom pairwise energies in the same molecular system, a pointwise product of n class="Disease">disordered matrices over all atom pairs in the observed molecule derives the final Z-matrix representing the collection of Boltzmann-weighted energies with n configurations for the entire molecule L (eq 5). As the final Z-matrix may contain unreasonable distance combinations between different atom pairs, a Q-matrix of atom pairwise radial distribution probabilities is introduced in order to avoid physically unreasonable combinations between different atom pairs at certain bond lengths, angles, and torsions. The elements in the Q-matrix were collected from a large structural database containing 8256 protein crystal structures from PDBBind v2013[190−192] database and 44766 small molecules from both PDBBind v2013 and the CSD small molecule database. Corresponding to each element in an atom pairwise Z-matrix, there is a distance-dependent probability value chosen from the radial distribution probability database of the same atom pair type. Q-matrices matching the composition of the corresponding Z-matrix are also assembled using pointwise products. The final Q-matrix for the molecule of interest is normalized before being multiplied by the final Z-matrix, assuring that the overall probability is 1. The sum of the final matrix () gives the ensemble average of the Boltzmann factors with a sampling size of n (matrix size). Hence, with a predefined sampling size of n for the Z and Q matrices, the energies of different molecular conformations can be generated simultaneously via matrix products over all atom pairs. The solvation free energy is then calculated by incorporating the ensemble average of the Boltzmann factors intowhere the energy of the molecule in solution (ELS) is modeled asDOFLS and n class="Chemical">DOFL indicate the degrees of freedom of the molecule in solution and in the gas phase, which were assigned the same value in the current implicit water model for simplicity. The free energy is computed directly from the NVT ensemble avoiding issues related to the additivity of the free energy. Theoretically[194,195] and experimentally,[196] it can be shown that the energy can be decomposed, while the entropy and free energies cannot. Herein, we assemble the interaction energies using eq 8 and then place this into eq 7 to directly compute the free energy, thereby, avoiding issues related to the decomposition of the free energy. This is a real advantage of the MT method, and in future work, we will describe using this approach to compute both entropies and energies using statistical mechanics. The MT energy sampling method can incorporate both an explicit and implicit water model into a solvation free energy calculation. Our previous attempt utilized a simple continuum ligand–solvent interaction model.[186] A new semicontinuum n class="Chemical">water model is developed herein, in which the solute–solvent interaction is calculated by placing water molecules around the solute. Water molecules were modeled as isotropic rigid balls with van der Waals radii of 1.6 Å.[197,198] Water molecules were placed into isometric solute-surrounding solvent layers, starting from the solute’s water accessible surface until 8 Å away from the solute’s van der Waals surface with an increment of 0.005 Å per layer. The number of water molecules was limited by comparing their maximum cross-sectional areas with the solvent accessible surface area at each solvent layer for each atom in the solute molecules. The number of water molecules (Nw) accessible to each atom at distance R away from the atomic center of mass is rounded down via filtering using the maximum cross-sectional area (Sw) of water with the atomic solvent accessible surface area (Sa) in the solvent layer at distance R. According to Figure 1, the maximum cross-sectional areas (Sw) of a water molecule is calculated aswhere Rw and Ra are the van der Waals radii for n class="Chemical">water and the atom in the solute molecule, respectively.

Figure 1

Modeling of the implicit solute–solvent model using the movable type method.

Modeling of the implicit solute–solvent model using the movable type method. The Boltzmann factor matrix for the kth solute atom–water () interaction is defined as a Boltzmann weighted solute atom-n class="Chemical">water energy multiplied by the number of accessible water molecules at the different distances. Multiplication of the Z-matrices for all solute atom–water interactions composes the final solute molecule-water Z-matrix (), which when multiplied by the Z-matrix for the intrasolute molecular interactions () derives the final Z-matrix for the solute–solvent complex system (). Multiplication of the final Z-matrix with its corresponding normalized Q-matrix generates the Boltzmann-weighted energy ensemble (). With the energy ensembles for the solute molecule () and solute–solvent complex (), the solvation free energy is calculated using eq 14.

KECSA Energy Function

Data Collection

A collection of structural information is the first requirement to assemble a statistical potential. Many crystal structures in the Cambridge Structural Database (CSD) represent small molecules cocrystallized with water molecules.[188] Moreover, the Protein Data Bank (PDB) also contains a large number of protein–ligand complexes with n class="Chemical">water molecules at the interface between binding pockets and ligand molecules albeit the resolution of these structures are poorer than typically encountered in the CSD. Since our goal was to construct a solvation energy model focusing on small molecules, the CSD small molecule database became our primary resource for structural data. In order to data mine the CSD we only examined structures with (1) an R factor less than 0.1 and (2) all polymer structures and molecules with ions were excluded. The resulting data set selected contained 7085 small molecules surrounded by crystal water molecules.

Atom Type Recognition

Statistical potentials are derived by converting the number density distributions between two atoms or residues to energies; hence, they are “fixed-charge” models for the selected atom or residue pairs. In order to differentiate atoms of the same type but with different electron densities, a detailed atom type categorization has been employed where 21 atom types (shown in Table 1) were chosen from the database as having statistically relevant water molecule contact information contained within the CSD.

Table 1

List of 21 Atom Types in the Current Solvation Model

atom type	description
C1	sp¹ carbon
C2	sp² carbon
C3	sp³ carbon
Car	aromatic carbon
N2	sp² nitrogen
N3	sp³ nitrogen
N4	positively charged nitrogen
Nam	amide nitrogen
Nar	aromatic nitrogen
Npl3	trigonal planar nitrogen
Ow	water oxygen
O2	sp2 oxygen
O3	hydroxyl oxygen
OE	ether and ester sp3 oxygen
Oco2	carboxylate, sulfate, and phosphate oxygen
S2	sp2 sulfur
S3	sp3 sulfur
P3	sp3 phosphorus
F	fluorine
Cl	chlorine
Br	bromine

Energy Function Modeling

The pairwise radial distribution is a mixed consequence of direct pairwise contacts and indirect environmental effects, which is usually described as a “mean force” driven correlation function. For the atom pairwise radial distribution, the mean force potential is generally described by the following formula.where P is the mean force potential on the kth particle in a system, E(r) is the energy between the kth particle and any particle i ∈ {a, b, ···n} in this system at a distance r separating these two particles. Intrinsically, statistical potentials have difficulty in separating out various chemical environment effects on the observed atoms, thereby generating a major source of error in this class of models.[169] As noted by Thomas and Dill,[184] over-represented contacts in a structural database could mask the presence of other contacts. With the reference state presupposing a uniform availability of all contacts, quantitatively minor contacts are always underestimated in statistical potentials. A new statistical potential energy function called n class="Chemical">KECSA developed in our group defines a new reference state attempting to eliminate the contact masking due to quantitative preferences.[189] Unlike traditional statistical potentials using a reference state mimicking the ideal gas state, KECSA defines a reference state energy or background energy as the energy contributed by all atoms surrounding the observed atom pairs. It introduces a reference state number distribution modeled by a linear combination of the number distribution under mean force (n(r)) and the number distribution of an ideal gas state ((n class="Chemical">N/V)4πr2Δr):where x indicates the intensity of the observed atom pairwise interaction in the chemical space V. This definition puts the number distribution of one certain observed atom pair in the reference state somewhere between the ideal gas state and the “mean force” state, depending on its relative strength. Stronger interactions have background energies closer to an ideal gas state while weaker interactions have background energies approaching the mean force state energy contributed by all atoms in the chemical space. To build a KECSA energy function modeling solvent, solute and solvent–solute interactions requires us to define an “x term” for each atom pairwise interaction. Several approaches have been used to model x in our knowledge-based energy function. In the original n class="Chemical">KECSA, we simply used the number ratio of the chosen atom pair i and j over the total atom pairs in the chemical space to represent the intensity x. Meanwhile, based on the assumption that all contacts are uniformly available in the chemical space given by the selected database,[154] we assigned an identical x for every atom pair found in the given chemical space.where N is the total atom type number in the chemical space. The original model of n(r) is based on the notion that every atom pair has an equal contact opportunity in a background energy contributed by the other atom pairs, while neglecting the fact that the background energies have different effects on atom pairwise distributions with different interaction strengths (say atom i and j under a covalent bond constraint compared to atom k and l under a nonbond interaction constraint). A more accurate n(r) model is introduced herein that takes every atom pairwise contact as a energy state distributed between an ideal gas state energy and mean force state energy following a Boltzmann distribution in the reference state. In this way, the x factor is defined aswhere e–β is the Boltzmann factor and N(r) is the degeneracy factor (contact number) for atom type pair i and j. With the x term built up as a probability of all contacts, the number distribution of the observed atom pair in the background state n(r) is modeled as Hence, we can build the energy function for each atom type pair as In eq 20, with the energy functions built up in the chosen chemical space, each E(r) can be derived iteratively at discrete distance points. Using this model, every E(r) derived using the KECSA energy function is never a mean force potential between atom pair i and j as found in traditional statistical potentials.[189] Instead, E(r) represents a pure atom pairwise interaction energy between i and j because the reference state energy defined in n class="Chemical">KECSA is a background energy contributed by all other atom pairs and not just the ideal gas state energy.

Test Set Selection

Two major differences between KMTISM and other continuum solvation models are (1) the MT method calculates the free energy change using a ratio of partition functions in the NVT ensemble, while traditional continuum solvation models separate the Gibbs free energies into linear components with enthalpy and entropy components.[139] (2) Electrostatic interactions are implicit via the categorization of pairwise atom-types in the n class="Chemical">KECSA model while they are calculated explicitly using classical or QM based energy calculation approaches. In this manner, KMTISM can be viewed as the null hypothesis for the addition of explicit electrostatic interactions. If electrostatic interactions are added to a model, it should outperform the knowledge-based approach; if not, the explicit electrostatic model is not an improvement over the implicit inclusion of this key interaction. A key concern for the KMTISM method is the validity of using preconstructed atom-type pairwise energy data in free energy calculations for molecules with similar atoms, which differ in their chemical environments. Hence, KMTISM was examined with test compounds containing C, O, N, S, P, and halogen atoms within different functional groups. Given that the KECSA energy function was parametrized using organic structural data, validation focused on reproducing the aqueous solvation free energy for drug-like molecules. The Minnesota Solvation Database is a well-constructed data set, including aqueous solvation free energies for 391 neutral molecules and 144 ions. This data set was filtered down to 372 neutral molecules and 21 ions in our test set via the exclusion of (1) inorganic molecules, and (2) molecules with atom types not represented in the KECSA potential. This test set, including various hydrocarbons, mono- and polyfunctional molecules with solvation free energies ranging from −85 to 4 kcal/mol, was further classified into different subsets based on the functional groups within the molecules. Some molecules were included in several subsets due to their polyfunctional nature. Carbon, n class="Chemical">nitrogen, and oxygen are essential elements in organic molecules. More than one-half of the compounds in the neutral test set (219 out of 372 compounds) were composed exclusively of C, N, and O atoms. From the Minnesota Solvation Database we created four subsets from these 219 molecules: 41 hydrocarbons, 91 molecules with oxygen based functional groups, 44 molecules with nitrogen based functional groups, and 43 molecules with mixed N and O functional groups. Validation also focused on molecules with sulfur, phosphorus, and halogen atoms, which play important roles in organic molecules. A test set with only halocarbons was created for the purpose of avoiding interference from other polar atoms. Sulfur and phosphorus, on the other hand, are often contained in oxyacid groups in organic molecules. Collected from the neutral data set, a test set with sulfur or phosphorus-containing molecules was composed. Heterocycles, amides, and their analogs are pervasive in drug-like molecules and are well represented in the Minnesota Solvation Database. Thirty-seven heterocyclic compounds and 33 amides and their analogs were categorized into two subsets. In addition, 28 molecules containing three or more different functional groups were selected to provide a challenging test with complex and highly polar molecules. The ion test set was limited to biologically relevant ions herein resulting in positively charged nitrogen and negatively charged carboxylate oxygen subsets. In this way, 21 ions were chosen from Minnesota Solvation Database (11 cations and 10 anions). Alkoxide ions among others present in the Minnesotae Solvation Data set will be examined in the future with the aid of molecular dynamics simulation of ion–water interactions for these ions but were excluded herein. Calculation results using KMTISM, MM-GBSA, and MM-PBSA for all test sets are contained in the Supporting Information. Only statistical data are given in Table 2 and Table 3 in this manuscript.

Table 2

Performance of KMTISM, MM-GBSA, and MM-PBSA for the Prediction of the Solvation Free Energies of Neutral Molecules

	total neutral molecule set			amide set
	KMTISM	MM-GBSA	MM-PBSA	KMTISM	MM-GBSA	MM-PBSA
R²	0.792	0.734	0.804	0.660	0.493	0.509
Kendall’s τ	0.755	0.708	0.793	0.568	0.484	0.465
raw RMSE (kcal/mol)	2.597	4.629	4.647	4.368	8.666	9.717
scaled RMSE (kcal/mol)	2.248	2.634	2.160	3.852	4.885	4.663

Table 3

Performance of KMTISM, MM-GBSA, and MM-PBSA for the Prediction of the Solvation Free Energies of Ions

	ion set
	KMTISM	MM-GBSA	MM-PBSA
R²	0.351	0.000	0.003
Kendall’s τ	0.258	–0.057	–0.067
RMSE (kcal/mol)	5.777	11.736	10.481

Results and Discussion

Comparison with MM-GBSA and MM-PBSA Results

Data analysis covered solvation free energy calculations for all subsets using KMTISM along with the corresponding MM-GBSA and MM-n class="Chemical">PBSA results. Both MM-GBSA and MM-PBSA calculation were performed using AMBER with the General AMBER force field (GAFF). GB parameters were set as igb = 2 and saltcon = 0.100. In the PB calculation, istrng = 0.100. Against the neutral molecule test set, KMTISM and MM-PBSA gave comparable correlation coefficients (R2) and both were better than MM-n class="Chemical">GBSA. According to Kendall’s τ values, MM-PBSA outperformed the other two methods in ranking ability, with KMTISM as the second best. In terms of accuracy of the models, KMTISM has the lowest root-mean-square error (RMSE), while the RMSE values for MM-GBSA and MM-PBSA were almost twice as large. A plot of the experimental data vs calculated data is given in Figure 2 while the statistical results are summarized in Tables 2 and 3.

Figure 2

KMTISM, MM-GBSA, and MM-PBSA calculated vs experimental solvation free energies (kcal/mol) for 372 neutral molecules (kcal/mol).

KMTISM, MM-GBSA, and MM-n class="Chemical">PBSA calculated vs experimental solvation free energies (kcal/mol) for 372 neutral molecules (kcal/mol). For the purpose of a more thorough analysis, a linear scaling model was applied to all three methods using eq 21 in order to bring their respective regression lines closer to y = x. Linear scaling, due to its data set dependence, did not improve the performance of the methods, but instead, it provided a way to examine the deviation of the calculated results from their regression lines. Here, a and b are the slope and the intercept of the regression line between experimental data and computed data, respectively. The MM-GBSA and MM-n class="Chemical">PBSA results suggested a biased regression against the experimental solvation energies (y = 1.3186x – 1.2902 for MM-GBSA and y = 1.5095x – 0.1556 for MM-PBSA, where x and y represent the experimental and calculated solvation free energies). The slopes of their regression lines indicated an overestimation of the solvation free energies using these two methods. The significant improvement in RMSE values for MM-GBSA and MM-PBSA after the linear scaling as well as their correlation coefficient (R2 and Kendall’s τ) values indicate that they have a better ranking ability than free energy prediction. On the other hand, KMTISM’s regression function (y = 1.1078x – 0.0811) affected the RMSE to a lesser extent. Results for different test sets classified by functional groups provide deeper insights into the prediction abilities of the three computational models. For all three approaches, errors increased with the complexity of the functional groups. Solvation free energies of hydrocarbons, n class="Chemical">halocarbons, and oxygen and nitrogen containing molecules were better reproduced than molecules with other functional groups (see Figure 3), while amides and mixed polyfunctional groups resulted in the largest RMSEs (see Figure 4).

Figure 3

KMTISM’s top three performing test sets according to RMSE. KMTISM, MM-GBSA, and MM-PBSA calculated solvation free energies (kcal/mol) vs experimental data are listed from left to right. From the top to bottom, the test sets are the hydrocarbon, oxygen containing, and halocarbon test sets.

Figure 4

KMTISM’s worst three performing test sets according to RMSE. KMTISM, MM-GBSA, and MM-PBSA calculated solvation free energies (kcal/mol) vs experimental data are listed from left to right. From the top to bottom the test sets are the amide, organosulfur and organophosphorus, and polyfunctional test sets.

KMTISM’s top three performing test sets according to RMSE. KMTISM, MM-GBSA, and MM-n class="Chemical">PBSA calculated solvation free energies (kcal/mol) vs experimental data are listed from left to right. From the top to bottom, the test sets are the hydrocarbon, oxygen containing, and halocarbon test sets. KMTISM’s worst three performing test sets according to RMSE. KMTISM, MM-GBSA, and MM-n class="Chemical">PBSA calculated solvation free energies (kcal/mol) vs experimental data are listed from left to right. From the top to bottom the test sets are the amide, organosulfur and organophosphorus, and polyfunctional test sets. Among all data sets, the hydrocarbon set was reproduced with the lowest RMSE value for all of the models, while n class="Chemical">unsaturated hydrocarbons proved more difficult for KMTISM than the other two methods. The drop in R2 for KMTISM was due to overestimation of the solvation free energies of unsaturated hydrocarbons, for example, the two compounds with the largest error (∼2 kcal/mol) were ethene and s-trans-1,3-butadiene, where all heavy atoms were sp2 hybridized. In the KECSA training set, which includes mostly drug-like molecules, different adjacent polar functional groups significantly altered the electron densities of adjacent unsaturated carbon atoms (via delocalization, for example) and this varies the electrostatic characteristics of these carbon atoms more than that seen in the case of sp3 hybridized carbon. On the other hand, polar atom types in the KECSA energy function were classified according to their corresponding hydrophilic functional groups and were less affected by adjacent functional groups. Polar atom type-n class="Chemical">water radial probabilities were driven by a more fine grained atom pairwise set of interactions, thereby, improving the performance of the KECSA energy function for these groups. The oxygenated molecule set and halocarbon set were among the top three test sets based on KMTISM’s performance according to RMSE. Against the oxygen containing molecule set, KMTISM gave a correlation coefficient comparable to MM-PBSA, while its RMSE was better than MM-GBSA. For the halocarbon set, KMTISM outperformed the MM-PB/GBSA methods according to the RMSE and correlation coefficients. Especially for fluorocarbons whose solvation free energies were much better reproduced by KMTISM compared to the MM-PB/GBSA methods. For the data set of 8 fluorocarbons, the RMSE for KMTISM was 1.1 kcal/mol compared to RMSE values as 5.8 kcal/mol for MM-GBSA and 2.2 kcal/mol for MM-PBSA. The variety of atom types in different molecules with different chemical environments make the atom type classification process an inherent source of error for any statistical energy function. The use of atom types in classical potentials is also an issue, but it is typically mitigated by an explicit electrostatic model, which takes into account environmental effects. Collecting similar atom types into the same category can reduce the predictive ability of a statistical potential. For example, the sp3 oxygen atom in n class="Chemical">ethers and peroxides were modeled using the same atom type within KECSA. This resulted in the solvation free energies for the two peroxides to be overestimated by KMTISM, that is, the ΔGsol for methylperoxide was −9.90 kcal/mol or −8.86 kcal/mol (scaled) vs the experimental value of −5.28 kcal/mol and the ΔGsol for ethylperoxide was −10.27 kcal/mol or −9.20 kcal/mol (scaled) vs the experimental value of −5.32 kcal/mol. In comparison with the MM-GB/PBSA methods the ΔGsol for methylperoxide was −9.89 kcal/mol or −6.51 kcal/mol (scaled) using MM-GBSA and −9.07 kcal/mol or −5.90 kcal/mol (scaled) using MM-PBSA; the ΔGsol for ethylperoxide was −9.21 kcal/mol or −6.00 kcal/mol (scaled) using MM-GBSA and −8.59 kcal/mol or −5.59 kcal/mol (scaled) using MM-PBSA. Hence, none of the methods examined particularly did well modeling the solvation free energy of peroxides. As the structural complexity of a molecule increased so did the computed RMSE. Possible long-range polar interactions add additional difficulty to accurate solvation free energy calculations using the methods described herein. The largest errors were found in the amide set, n class="Chemical">organosulfur and organophosphorus set, and polyfunctional molecule set for all three methods. With lower errors for most individual polar functional groups based on the analysis of the monofunctional test set results, KMTISM had less cumulative error against these three test sets when compared with the MM-GB/PBSA methods for both the raw RMSE and scaled RMSE values (see Table 2). This result suggests that KMTISM has an advantage over the MM-GB/PBSA methods for the prediction of the solvation free energy of polyfunctional molecules. This advantage will have a significant effect on the ability of this model to predict, for example, protein–ligand binding affinities, where the solvation free energy of the ligand can have a significant impact on binding affinity prediction. Although the magnitude of the errors in the solvation free energies for the ion test set were relatively poor for all three methods (see Figure 5), KMTISM showed better correlations and RMSE than the other two implicit water models, especially for the charged n class="Chemical">amine test set (see Table 3). While the error magnitude was large for all methods the percentage error is comparable to the neutral set. The carboxylate functional group, which is conjugated, lowered the accuracy of KMTISM’s calculation, while charged amines, on the other hand, whose electron densities are more localized, were better reproduced by the KMTISM method.

Figure 5

Solvation free energy results for KMTISM, MM-GBSA, and MM-PBSA against the ion test sets.

Solvation free energy results for KMTISM, MM-GBSA, and MM-n class="Chemical">PBSA against the ion test sets.

Comparison with SMX Results

Overall, QM based solvation models have had limited application to the study of macromolecular systems due to its higher computational expense, but are routinely used to understand the effect of solvation on small molecules. A thorough analysis of KMTISM against QM based solvation models was not the focus of the present research, but a general comparison helps put the present work in perspective relative to more advanced models. Cramer and Truhlar’s Solvation Model (SMX) series has been developed over several decades and is considered to be one of the best methods available to calculate solvation free energies of small molecules.[78−126] Their most up-to-date model reported mean absolute errors (MAE) for solvation free energy (ΔGsol) prediction ranging from 0.57 to 0.84 kcal/mol using different QM methods against 274 neutral molecules and calculated ΔGsol MAEs ranged from 2.7 to 3.8 kcal/mol against 112 ions.[126] Against a similar small molecule and ion test set, KMTISM reproduced a calculated ΔGsol MAE of 1.8 kcal/mol against 372 neutral molecules and a MAE of 5.1 kcal/mol against 21 ions. Even though the data sets exn class="Chemical">amined were different, the trend is clear that the latest SMX models are more accurate than KMTISM (and MM-GBSA and MM-PBSA) by ∼1 kcal/mol for both the neutral molecules and the ions as measured by MAE. Nonetheless, the performance of our first generation KMTISM model is impressive and future versions of this model should yield even better accuracy.

Conclusion

MM-GBSA and MM-n class="Chemical">PBSA are two broadly used implicit solvation models. KMTISM, using a new sampling method (MT method), combined with a statistical energy function (KECSA), is found to have a comparable or a better ability to predict the solvation free energy for several test sets selected from the Minnesota Solvation Database. Though all of these methods perform worse than the most recent SMX model reported by Cramer and Truhlar. It is important to appreciate that without using the approximation that the free energy of solvation is a collection of linearly combined free energies, as is employed in many traditional continuum solvent models, KMTISM uses computed energies to directly determine free energies. Hence, the Helmholtz free energy is calculated by the construction of the relevant partition functions. These partition functions are efficiently assembled using a new sampling method, the MT method, which is able to rapidly estimate free energy, enthalpy, as well as entropy changes. Drawbacks of the KMTISM model lie in its use of a statistical energy function, whose parametrization can have weak spots for atom types with high polarizabilites and uncommon atom types whose lack of available experimental data can pose issues. Future work includes (1) a detailed study of enthalpy changes and entropy changes using the MT method, (2) improving the statistical energy terms by data collection from MD simulations of atom types with high polarizability and uncommon atom types in structural databases, and (3) replacing the statistical energy function with different force field based energy functions and combine them with the MT sampling method in order to affect the rapid evaluation of thermodynamic quantities.

100 in total

1. Exploring the energy landscape of a beta hairpin in explicit solvent.

Authors: A E García; K Y Sanbonmatsu
Journal: Proteins Date: 2001-02-15

2. Electrostatics of nanosystems: application to microtubules and the ribosome.

Authors: N A Baker; D Sept; S Joseph; M J Holst; J A McCammon
Journal: Proc Natl Acad Sci U S A Date: 2001-08-21 Impact factor: 11.205

3. SMall Molecule Growth 2001 (SMoG2001): an improved knowledge-based scoring function for protein-ligand interactions.

Authors: Alexey V Ishchenko; Eugene I Shakhnovich
Journal: J Med Chem Date: 2002-06-20 Impact factor: 7.446

4. Free energy surfaces of beta-hairpin and alpha-helical peptides generated by replica exchange molecular dynamics with the AGBNP implicit solvent model.

Authors: Anthony K Felts; Yuichi Harano; Emilio Gallicchio; Ronald M Levy
Journal: Proteins Date: 2004-08-01

5. Self-Consistent Reaction Field Model for Aqueous and Nonaqueous Solutions Based on Accurate Polarized Partial Charges.

Authors: Aleksandr V Marenich; Ryan M Olson; Casey P Kelly; Christopher J Cramer; Donald G Truhlar
Journal: J Chem Theory Comput Date: 2007-11 Impact factor: 6.006

6. Prediction of Absolute Solvation Free Energies using Molecular Dynamics Free Energy Perturbation and the OPLS Force Field.

Authors: Devleena Shivakumar; Joshua Williams; Yujie Wu; Wolfgang Damm; John Shelley; Woody Sherman
Journal: J Chem Theory Comput Date: 2010-04-14 Impact factor: 6.006

Review 7. Electrostatic contributions to molecular free energies in solution.

Authors: M Schaefer; H W van Vlijmen; M Karplus
Journal: Adv Protein Chem Date: 1998

8. Ligand conformational and solvation/desolvation free energy in protein-ligand complex formation.

Authors: Michal Kolár; Jindrich Fanfrlík; Pavel Hobza
Journal: J Phys Chem B Date: 2011-04-05 Impact factor: 2.991

9. Binding free energy calculations between bovine β-lactoglobulin and four fatty acids using the MMGBSA method.

Authors: Martiniano Bello
Journal: Biopolymers Date: 2014-10 Impact factor: 2.505

8 in total

1. Overview of the SAMPL6 host-guest binding affinity prediction challenge.

Authors: Andrea Rizzi; Steven Murkli; John N McNeill; Wei Yao; Matthew Sullivan; Michael K Gilson; Michael W Chiu; Lyle Isaacs; Bruce C Gibb; David L Mobley; John D Chodera
Journal: J Comput Aided Mol Des Date: 2018-11-10 Impact factor: 3.686

2. MathDL: mathematical deep learning for D3R Grand Challenge 4.

Authors: Duc Duy Nguyen; Kaifu Gao; Menglun Wang; Guo-Wei Wei
Journal: J Comput Aided Mol Des Date: 2019-11-16 Impact factor: 3.686

3. Mathematical deep learning for pose and binding affinity prediction and ranking in D3R Grand Challenges.

Authors: Duc Duy Nguyen; Zixuan Cang; Kedi Wu; Menglun Wang; Yin Cao; Guo-Wei Wei
Journal: J Comput Aided Mol Des Date: 2018-08-16 Impact factor: 3.686

Review 4. Overview of the SAMPL5 host-guest challenge: Are we doing better?

Authors: Jian Yin; Niel M Henriksen; David R Slochower; Michael R Shirts; Michael W Chiu; David L Mobley; Michael K Gilson
Journal: J Comput Aided Mol Des Date: 2016-09-22 Impact factor: 3.686

5. Application of the Movable Type Free Energy Method to the Caspase-Inhibitor BindingAffinity Study.

Authors: Song Xue; Hao Liu; Zheng Zheng
Journal: Int J Mol Sci Date: 2019-09-29 Impact factor: 5.923

6. MovableType Software for Fast Free Energy-Based Virtual Screening: Protocol Development, Deployment, Validation, and Assessment.

Authors: Zheng Zheng; Oleg Y Borbulevych; Hao Liu; Jianpeng Deng; Roger I Martin; Lance M Westerhoff
Journal: J Chem Inf Model Date: 2020-09-11 Impact factor: 4.956

Review 7. Virtual Combinatorial Chemistry and Pharmacological Screening: A Short Guide to Drug Design.

Authors: Beatriz Suay-García; Jose I Bueso-Bordils; Antonio Falcó; Gerardo M Antón-Fos; Pedro A Alemán-López
Journal: Int J Mol Sci Date: 2022-01-30 Impact factor: 5.923

8. On the fly estimation of host-guest binding free energies using the movable type method: participation in the SAMPL5 blind challenge.

Authors: Nupur Bansal; Zheng Zheng; David S Cerutti; Kenneth M Merz
Journal: J Comput Aided Mol Des Date: 2016-10-03 Impact factor: 3.686

8 in total