Literature DB >> 34165298

Alkane/Water Partition Coefficient Calculation Based on the Modified AM1 Method and Internal Hydrogen Bonding Sampling Using COSMO-RS.

Panagiotis C Petris^1,2, Paul Becherer², Johannes G E M Fraaije^1,2.

Abstract

We introduce a physics-based model for calculating partition coefficients of solutes between water and alkanes, using a combination of a semi-empirical method for COSMO charge density calculation and statistical sampling of internal hydrogen bonds (IHBs). We validate the model on the experimental partition data (∼3500 molecules) of small organics, drug-like molecules, and statistical assessment of modeling of proteins and ligand drugs. The model combines two novel algorithms: a bond-correction method for improving the calculation of COSMO charge density from AM1 calculations and a sampling method to deal with IHBs. From a comparison of simulated and experimental partition coefficients, we find a root-mean-square deviation of roughly one log 10 unit. From IHB analysis, we know that IHBs can be present in two states: open (in water) and closed (in apolar solvent). The difference can lead to a shift of as much as two log 10 units per IHB; not taking this effect into account can lead to substantial errors. The method takes a few minutes of calculation time on a single core, per molecule. Although this is still much slower than quantitative structure-activity relationship, it is much faster than molecular simulations and can be readily incorporated into any screening method.

Entities: Chemical Disease Species

Year: 2021 PMID： 34165298 PMCID： PMC8317156 DOI： 10.1021/acs.jcim.0c01478

Source DB: PubMed Journal: J Chem Inf Model ISSN： 1549-9596 Impact factor: 4.956

Introduction

In previous publications, we have discussed automated coarse-graining protocols to calculate partition coefficients[1] and diffusion coefficients.[2] In this work, we focus on providing a fast and reliable method for calculating the alkane/water partition coefficient of lead-like molecules with a molecular weight of up to 800. As far as modeling is concerned, predicting large-molecule partitioning in a practical industrial setting is extraordinarily difficult. One must be able to accurately calculate the COSMO charge density distribution of molecules with a molecular weight of up to 800 and then the free energy of partition (here by COSMO-RS[3]) in a reasonable amount of time (a few core minutes). Finally, internal hydrogen bonding (IHB) redistribution should be considered, taking proper account of statistical thermodynamics. In simple terms, when the molecule is dissolved in water, it is energetically favorable to expose the IHB to the solvent, while in an organic solvent, the molecule forms IHB to inhibit its interaction with the surrounding apolar environment. In a recent statistical assessment of modeling of proteins and ligand (SAMPL5) competition to predict cyclohexane–water partition coefficients of lead-like molecules,[4−6] it was found that COSMO-RS beats the competition, including force-field methods, in both calculation time and accuracy. From a medicinal chemist point of view, one has only a few choices. The increasing size of drugs/drug candidates and the need for fast estimation in medicinal chemistry screening require significant advancements in ab initio(7) and force-field-based[8,9] methods to compete against COSMO-RS,[4] UNIFAC,[10] and quantitative structure–activity relationship (QSAR) approaches.[11] QSAR approaches are the ones requiring the least computational resources, although they are not always precise, and most importantly, they do not provide physics-based insights to improve on them. A statistical thermodynamic model like COSMO-RS or UNIFAC can provide accurate results in relatively short time but cannot really be used in any system without careful consideration. More specifically, charged molecules cannot really be handled using COSMO-RS in its current implementation. Certain functional groups are not described adequately enough, regardless of the method used to extract COSMO charge densities. Theoretically, these issues are resolved by using ab initio and force-field-based models. Such approaches will always be the most accurate ones but very expensive computationally. Depending on the system, the properties, and the application of interest, one should decide which method to use cautiously, as this is the most important choice in building a successful workflow. Despite their shortcomings, COSMO-RS and UNIFAC that provide accurate estimations considerably faster than force-field-based methods are increasingly being used in the literature. In other domains, such as octanol–water partition coefficients, one could perhaps rely on massive amounts of data to build an ultra-fast machine learning correlation or a QSAR model. However, for the systems of current interest, drug-like molecules partitioning in water and alkane solvents, there are simply not enough data to calibrate correlations. Hence, a statistical approach cannot be adopted, and models must necessarily be physics-based. A suitable candidate can be found in the COSMO-RS model because it requires only a handful of parameters, all with realistic physical meaning, while it is, at the same time, precise. Unfortunately, in standard COSMO-RS, one needs to calculate the charge envelope by performing a costly density functional theory (DFT) calculation, which could take many days for the molecular targets of our interest. Approximate methods exist that combine fragments extracted from a library (COSMOfrag[12,13]), but the library and extraction method are not publicly available nor described in sufficient detail to rebuild these methods. The second significant problem is that many lead-like molecules exhibit IHBs as these amplify drug-design specificity. From analyzing small molecules such as salicylic acid and o-nitrophenol, it is clear that IHBs can have a dramatic effect on the partition coefficient. The difference between open (in water) and closed (in hydrocarbon solvent) state can account for roughly one to two log 10 unit per IHB.[14−16] When extended to large molecules (i.e., MW 800) that can easily have 3–5, if not more, IHBs, the effect is dramatic, and getting this right will make or break any prediction tool. The present work is structured specifically to resolve the two issues mentioned above (fast estimation of COSMO charge density and the IHB effect). In the Systems and Methods section, we describe the methodology that we have developed to predict COSMO charge density in a fast and accurate way, using the AM1 semi-empirical method[17−20] alongside a bond charge correction.[21,22] To cope with the IHB effect, a separate algorithm was developed to generate the conformations that include all the possible open (no IHB formed) and closed (IHB formed) states for a given molecule. We also discuss the way to extract the partition coefficient and the data sets used. In the Results section, we present the COSMO charge density as calculated from our newly developed method, AM1-bond charge correction-COSMO (ABC), and we compare it with the corresponding DFT outcome. We optimize COSMO-SAC coefficients against the thermodynamic and partitioning data. We apply the ABC model with the IHB algorithm and the optimized COSMO-SAC coefficients to the BioByte and SAMPL5 data set. In the Discussion section, we report a few improvements that could make the model more accurate. The Supporting Information contains the BCC parameters for ABC, the molecules extracted from the BioByte and SAMPL5 data set, and the Kuhn patterns used to identify IHB. All algorithms, including AM1, are implemented in CULGI software version 13.0,[23] installed on a Dell desktop PC equipped with an Intel Xeon E5-2670 dual processor.

Systems and Methods

This section is separated into four different subsections explaining the methods we have developed and the systems studied. We present the “AM1-bond-charge-correction-COSMO” (AM1-BCC-COSMO, ABC for short) algorithm we have developed for calculating the COSMO charge density of a molecule. We include a detailed explanation of the IHB algorithm we developed to generate all possible IHB conformers. Subsequently, we illustrate the methodology for parameterizing COSMO-SAC coefficients and the features and characteristics of the data sets we applied the protocol to (CULGI,[1] NIST standard reference database 103b as implemented in the NIST ThermoData Engine (TDE) version 7.0, BioByte,[24] and SAMPL5[4,25,26]). A schematic representation of the implemented protocol can be found in the form of a flowchart in Figure .

Figure 1

Protocol’s flowchart is used to optimize the ABC model, generate IHB conformers, parameterize the COSMO-SAC coefficients, and predict the SAMPL5 distribution coefficients (denoted by a dashed rectangle).

ABC Algorithm

We introduce a COSMO charge density calculation method based on Dewar’s AM1,[17−20] a widely used and fast semi-empirical quantum method. Our implementation of COSMO in AM1 is largely based on the method described by Klamt and Schüürmann,[24] specifically solving the linear system from eq 9 in that paper directly. However, for calculating the interaction of the electron density (orbitals) with the COSMO surface charges, we follow the implementation by Klamt in MOPAC 7.[27−29] There, the electron density is approximated by a collection of atom-centered multipoles (monopoles, dipoles, linear quadrupoles, and square quadrupoles) in the same way as this is used for calculating the two-center repulsion integrals in MNDO,[30,31] using the same parameters as MNDO. An additional approximation is then made by treating the multipoles as point dipoles instead of physical dipoles (taking the limit to zero distance between charges while keeping the multipole moments constant). This leads to relatively simple analytical expressions for the electrostatic potential arising from the electron density. The explicit expressions can be found in the Supporting Information. Additionally, our implementation of COSMO in AM1 generates smoother COSMO surfaces than MOPAC. The construction of the solvent-excluded surface is based substantially on the work of Sanner et al.[32] AM1 by itself does not give accurate electrostatic moments and as a result gives unsatisfying COSMO charge density,[33] which we propose to repair by a QSAR type of approach. Our new ABC algorithm is a bond charge correction that works similar to the BCC method for force fields.[21,22] Compared to the force-field BCC algorithm, we do not optimize for atomic partial charges but for COSMO surface charge density. The BCC parameters are calculated by comparing DFT to AM1 results; no experimental input is needed, but instead, we can profit from matching vast amounts of purely computer-generated data. The algorithm itself can be described as follows: Perform an AM1 calculation with COSMO as usual, giving the COSMO surface charge density σMAM1 of the molecule. Apply the AM1-BCC method (with modified parameters to be determined) to obtain atom-centered correction charges qMBCC Calculate the COSMO surface charge density that results from these correction charges alone, ΔσMBCC by solving separately the COSMO equations using only the correction charges. Because the COSMO equations are linear, this COSMO correction charge density is added to the charge density from the original AM1-COSMO calculation to obtain an improved COSMO surface charge density σMAM1-BCC = σMAM1 + ΔσMBCC. In this way, the improvement of the COSMO calculation is a straightforward post-processing step that can be performed after a regular AM1-COSMO calculation. The improved COSMO sigma profile can then be used in COSMO-RS-type calculations, after re-optimizing the COSMO-RS parameters for this source of sigma profiles. The modified AM1-BCC parameters for this procedure are found by fitting the parameters to minimize the difference between the AM1-BCC-COSMO surface charge density σMAM1-BCC = σMAM1 + ΔσMBCC and the σMDFT resulting from a full DFT COSMO calculation. This is in contrast with the force-field BCC method, where the parameters are fitted to reproduce atomic partial charges. For the optimization, we must construct an objective function to be minimized, which expresses the deviation of the AM1-BCC-COSMO results from the full DFT calculations. The surface charge density is calculated on a discretized (triangulated) surface, leading to a representation as a collection of local surface charge densities σM,I,, one for each surface segment i in the molecule M. In the objective function, we multiply each σ with the surface area a of the segment to properly weigh the size of the surface segments. We, then, minimize numerically the following objective function Minimizing this function (the total square deviation) is equivalent to minimizing the root-mean-square deviation. In order to calculate the difference between the surface charge density distributions segment by segment, we perform a COSMO calculation on the DFT BP86/def2-TZVP[34−39]-optimized geometry for both the full DFT calculation and the AM1-BCC-COSMO calculation, using the same COSMO surface tessellation for both cases. The set of molecules that we used for this parameterization is a subset of an in-house database of over 11,000 molecules (“the Culgi molecular database”) that all have pre-computed COSMO information, calculated with DFT as described above. To obtain a manageable subset of this database while making sure that we have enough occurrences of each BCC bond type, we followed this procedure: For each bond type in the list of BCC bond types, we determined the molecules that contain this type. If there were fewer than 10 molecules containing a bond type, we did not include that bond type in our list. This means, for example, that we cannot apply our AM1-BCC-COSMO on iodine-containing molecules because we only have a handful of these in our entire database, not enough for a meaningful parameterization of the iodine-containing bond types. The more than 300 bond types that were present in the original AM1-BCC scheme were thus reduced to 149 types. Of these 149 types, there are 17 that are bonds between atoms of identical type, where the correction charge is necessarily 0. That leaves 132 non-trivial bond types. These parameters suffice for the vast majority of molecules in the Culgi Molecules Database. For each bond type that has 10 or more molecules containing it, 10 molecules with the lowest molecular weight were selected. The overlap between these sets of 10 molecules left us with 965 molecules to parameterize 132 non-trivial bond charge-correction parameters. The objective function was minimized using an L-BFGS-B minimizer as implemented in the SciPy library for the Python programming language. The resulting set of bond charge corrections can be found in the Supporting Information. A more quantitative analysis of the results of applying this to the entire Culgi Molecules Database is presented in the Results section. However, an indicative demonstration of how well the new model works is provided in Figure . We show the COSMO charge density distribution of propanol calculated from AM1, ABC, and DFT, respectively. It is evident that ABC (red line) is far closer to DFT results (blue line) than plain AM1 (black line), even for a molecule without special and exotic functional groups, such as propanol.

Figure 2

COSMO charge density distribution of propanol calculated using (a) AM1 semiempirical method, (b) ABC model, and (c) DFT with the modified NWChem software package.[40]

COSMO charge density distribution of propanol calculated using (a) AM1 semiempirical method, (b) ABC model, and (c) DFT with the modified NWChem software package.[40] DFT calculations were carried out with a modified version of NWChem 6.3,[40] with an improved COSMO mesh generation based substantially on the work of Sanner et al.,[32] and with outlying charge correction.[41] Molecules were geometry-optimized in vacuum, using the def2-TZVP basis set[34,37] and BP86 exchange–correlation functional.[36−38] A COSMO calculation was then performed on that geometry using the def2-TZVPD basis set.[34,35,39] These are the recommended high-quality settings for COSMO-RS as used in COSMOtherm software[42] and also used in the entry of Klamt et al. in the SAMPL5 challenge.

COSMO-RS Partition Coefficients

The ABC algorithm is completed by reparameterization of some of the COSMO-SAC coefficients by fitting four sets of thermodynamic and partitioning data. Our implementation is based on COSMO-SAC,[43,44] but because COSMO-SAC and COSMO-RS differ only cosmetically, we will use the name COSMO-RS throughout the paper. With the ABC method, we can calculate the COSMO charge density of even a large molecule (MW 800, the upper range of SAMPL5 molecules) in just a few minutes on a single core. Afterward, one can easily derive the partition coefficient in less than a second. The partition coefficient is defined as the ratio of concentrations in the two solvents as followswhere Vwater and Voil are the molar volume of water and oil, respectively, while the difference in chemical potentials Δμ0 is calculated from COSMO-RS through the activity coefficients γ, viawhere a is the surface area and σ is the COSMO surface charge density on segment i. When two or more conformations of the same molecule are included in the partition coefficient calculation, the corresponding relation is transformed. For two conformations A and B, in oil/water solvents, the cumulative partition coefficient is provided by the following equationwhere Eoil(A) and Ewater(A) are the total energy required for the conformation A to move from the perfect conductor state into a solvent like oil or water, respectively (same for conformation B). Such a quantity for a conformer i in a solvent S can be calculated aswhere ECOSMO(i) is the total quantum mechanical energy of the conformer i in the perfect conductor and μS(i) is the chemical potential of the conformer i in solvent S that accounts for the transition of this conformer from a perfect conductor into the solvent S. COSMO-RS theory amounts to the integration over the surface, of a local free energy difference for a charged segment exposed to two different solvents Δf = foil – fwater. The local free energy function differs between various COSMO-RS flavors; in most cases, as we do here, one accommodates the polarization energy of a charged segment in a conductor and a hydrogen bonding term. In the limited sense of integration over the surface, the model is not too dissimilar from QSAR, where one integrates over different types of polar groups and hydrogen donors and acceptors. The essential difference is that in COSMO-RS, one does not employ a group description and group parameters but a generic interaction model based on electrostatics from quantum chemistry calculations, and therefore, one needs only a few physics-based parameters, instead of a whole zoo of groups and group parameters. In this paper, we use the “SplitOH” COSMO-SAC model from Lin’s group[45,46] (a recent approach by Chang et al.[47] suggests directional interactions for a more accurate representation of hydrogen bonding). The Staverman–Guggenheim combinatorial term is used. Instead of taking the published Lin parameters as is, we re-optimized the COSMO-SAC coefficients. This was carried out by minimizing the total root-mean-square error for four sets of thermodynamic and partitioning data: 400 pressure data points for binary vapor–liquid equilibria (VLEs) 400 excess Gibbs free energies for 50/50 liquid binary mixtures (derived from VLE data) 400 partition coefficients for octanol/water 212 partition coefficients for hexadecane/water VLE data were obtained from the NIST standard reference database 103b as implemented in the NIST TDE version 7.0. Partition coefficients were obtained from the BioByte data set.[24] More information related to the parameterization scheme can be found in the Supporting Information. The optimized settings can be seen in the following table where cbase defines the interaction between any related segment, cOH–OH provides the (additional) interaction between segments with hydroxyl groups, cOH–OT gives the (additional) interaction between a segment containing a hydroxyl group and a segment with a different hydrogen bonding acceptor, and cOT–OT represents the interaction between two segments that do not contain any hydroxyl group. The segment surface area is the size of a thermodynamically independent segment. This is equivalent to the inverse of the so-called “coordination number” in chemical engineering models. One can use the suggested coefficients in Table alongside a software package supporting COSMO-SAC calculations. An open source implementation of COSMO-SAC has recently become available.[48]

Table 1

Optimized COSMO-SAC Coefficients

c_base (kcal/mol)·(Å/e²)	c_OH–OH (kcal/mol)·(Å/e²)	c_OH–OT (kcal/mol)·(Å/e²)	c_OT–OT (kcal/mol)·(Å/e²)	segment surface area (Å²)
6032.98	3175.29	2209.34	940.67	7.58

IHB Sampling

The special treatment of IHB in a molecule is related to the different conformations that the molecule can form when dissolved in a polar or apolar solvent. More specifically, if a molecule that exhibits an IHB is dissolved in a polar solvent, the acceptor (hydroxyl, amine, etc.) and the donor would be exposed to the solvent, the so-called “open” conformation. If dissolved in an apolar solvent (i.e., alkane), the molecule tends to form IHB so that the acceptor and donor are hidden from the solvent; this is the “closed” conformation. Therefore, we have conducted conformation analysis for molecules exhibiting IHB, so that all possible combinations of open and closed IHB structures will be included in the final analysis. Such analysis ensures that the effect of the solvent on the molecular conformation is explored and accounted for. Following the work of Kuhn et al.[49] we have used the Kuhn rings to identify the presence of IHB in a molecule. The patterns do not include the trans-core IHBs that can be found in macrocycles, and we therefore omitted all such molecules from the present study. We have included an additional pattern for the IHBs between carboxylic groups and aromatic nitrogen. Finally, a few additional rings are included to make sure that we capture the IHBs present in smaller molecules like salicylic acid and o-nitrophenol. The second challenge in IHB sampling is to find an automatic way of reproducing the conformations that would include all the combinations of open and closed IHBs. We start by assigning zero charges to all atoms, apart from the donor and acceptor atoms that take part in the formation of the IHB. To the donor/acceptor couple, we assign opposite charges if the closed conformation should be formed (attractive electrostatic interaction) or equal charges if the open conformation should be formed (repulsive electrostatic interaction). Using these charges, we run successive molecular dynamics runs followed by energy minimization, until a hydrogen bond distance criterion is met (distance below a certain value if an IHB is to be formed and distance above a certain value if the atoms should not form an IHB). Each molecular dynamics run consists of 1000 steps. We have selected an NVT ensemble with the Andersen thermostat, at a temperature of 298 K, with a 0.5 fs timestep coupled with the Dreiding force field. As the second step, we re-assign the correct charges and geometrically optimize the conformation, so that we make sure that the structure is energetically favorable. This way, after the optimization step, some of the conformations might end up without the desirable IHB, although the vast majority of them will maintain the desired structure. Then, we apply AM1 geometry optimization to each conformation to ensure that we have a physically meaningful representation of the molecule. The COSMO charge density is then calculated using ABC for every conformation. Finally, we calculate the partitioning of the molecule between water and alkane using eq . A characteristic example can be found in salicylic acid. Figure illustrates the two different conformations that should be taken into account when calculating the partition coefficient of this molecule in water/alkane solvents. When the molecule is dissolved in water, the “open” conformation, depicted in Figure a, is practically the only one observed. In contrast, when dissolved in alkane solution, the “closed” conformation, represented in Figure b where an IHB is formed, is predominant. The COSMO surface is visualized in the form of a color map. The red surface areas correspond to a positive COSMO charge density, while the blue areas carry a negative COSMO charge density. The gray COSMO surface indicates neutral COSMO charge density.

Figure 3

Salicylic acid’s (a) open conformation, favorable in an aqueous environment, and (b) closed conformation, favorable in alkane solution. The COSMO surface is depicted in the form of a color map.

Salicylic acid’s (a) open conformation, favorable in an aqueous environment, and (b) closed conformation, favorable in alkane solution. The COSMO surface is depicted in the form of a color map. The severe effect of leaving out the closed conformation is evident when estimating the salicylic acid partition coefficient in water/cyclohexane. Using eq , the open conformation partition coefficient is equal to −5.43, while the experimental one is −1.97. If we use eq to include both conformations, we get to the value of −2.54, which is in good agreement with the experiment.

Data sets

We used the BioByte database[24] as the primary source for partition coefficients between water and alkane solvents (hexane, cyclohexane, heptane, isooctane, octane, decane, undecane, dodecane, and hexadecane). We have curated the data set manually: for the solutes with data in more than one solvent, we only accepted those entries where log 10 deviated less than 1.5 unit between the lowest and highest. We accepted all entries for which only one solvent is present, for lack of a good curation criterion. We have also omitted all tautomer-forming molecules (as they need to be separately investigated for different solvents, while we have not developed a special model for their treatment), all molecules with more than five rotatable bonds (because we focus mostly on the residual contribution, so we limit surface flexibility and folding effects), and all crown ethers, where water binding should be taken into account due to the extreme hydrogen bonding behavior. The OpenBabel package is utilized to generate the molecular three-dimensional conformations from their corresponding SMILES strings. Subsequently, we stretch the molecule open to prevent energetically unfavorable starting conformations; this also avoids unphysical steric and intramolecular effects. To do so, we apply molecular dynamics without charges to open the molecule, followed by energy minimization with the corresponding force-field charges to get a molecule with realistic geometry. When the same molecule is encountered again for different solvents, we keep the generated conformation from the earlier occurrences to ensure that statistical mechanics from the molecular dynamics steps will not sample different conformations for different alkane solutes, hence providing different results. As discussed in the previous section, a further distinction is made between the molecules with IHB (a few hundred) and those without IHB (a few thousand). The molecules are listed in the Supporting Information. A second set is the SAMPL5 data set of drugs and drug-like candidates.[4,25,26] In Figure , we present the molecular weight distribution for the BioByte and SAMPL5 data set. The average of the molecular weight distribution for the BioByte data set is equal to 180 and for SAMPL5 is equal to 290. This difference is expected because BioByte contains many smaller solutes, while SAMPL5 includes larger lead-like molecules.

Figure 4

Molecular weight distribution for BioByte and SAMPL5 data sets.

Results

In this section, we demonstrate how we acquired the correction charges for the ABC model for some of the most representative functional groups. Subsequently, we have applied the protocol (ABC alongside IHB sampling) to predict the partition and distribution coefficients in the BioByte and SAMPL5 data sets, respectively.

ABC Optimization

As explained in the Systems and Methods section, we optimized the ABC model against DFT results for a wide range of functional groups found in the CULGI database (extensively analyzed in previous publications[1]), such that the surface charge density disparity between ABC and DFT is minimized. The quadratic error in the surface charge density between AM1 and DFT and between ABC and DFT (for a few of the most frequently appearing functional groups) is represented using boxplots in Figure .

Figure 5

Quadratic surface charge density error between DFT and (a) AM1 and (b) ABC for a selection of different functional groups represented in a boxplot fashion.

Quadratic surface charge density error between DFT and (a) AM1 and (b) ABC for a selection of different functional groups represented in a boxplot fashion. By comparing either the median value or the relative position of a boxplot between AM1 and DFT (Figure a) and between ABC and DFT (Figure b), one can see that for all functional groups, apart from sulfides, there is a moderate to substantial improvement using ABC instead of AM1. We can see the molecules with hypervalent sulfur and phosphorus in the upper regions of Figure a: “sulfonamide” and “phosphate”—the latter including thiophosphates. In Figure b, these groups improve significantly, with errors ending up in the same range as more “regular” molecules. It is known that AM1 does not deal well with these atom types, and this is also clear from looking at the COSMO surface of these molecules when comparing the DFT COSMO surface charge density with the AM1 COSMO surface charge density, as we show graphically in the Supporting Information for the pyridine-3-sulfonamide molecule. The nitriles show a moderately large improvement. In general, for large molecules, the dominant source of error is often formed by large alkyl tails. The values of the optimized parameters also bear this out. For the bonds involving hypervalent S and P, the charge shift is often as large as 0.4 to even 0.9 electron charges. For nitriles and similar groups, this may still be up to 0.1 to 0.2, but in the vast majority of other bonds, the charge displacement is below that value. To quantify our metric in familiar COSMO terms, if there is a difference in σerror of one unit between the estimation of AM1 and ABC, this corresponds to approximately 0.85 kBT difference in the total surface segment energy. Because the introduced bond charge correction significantly improves AM1 results, we are justified to apply it in calculating the COSMO charge density and, subsequently, the partition coefficient.

Protocol Application to the BioByte Data Set

We have applied the ABC model to the BioByte data set to predict the alkane/water partition coefficient. For molecules that exhibit IHBs, we applied the IHB sampling algorithm as described in the Systems and Methods section. For molecules containing carboxyl groups, we use only the syn conformation (with the hydrogen pointing in roughly the same direction as the carbonyl oxygen) and not the anti-conformation (with the hydrogen pointing away) because the former is the most stable and energetically favorable one as far too polar behavior is observed when we sample energetically unfavorable carboxyl anti-conformations (donor and acceptor exposed to the solvent). The necessity of special treatment for carboxyl groups has already been demonstrated in force-field-based works in the literature.[50,51] Figure compares the experimental Kexp and calculated Kcalc partition coefficients for each hydrocarbon solvent (namely, isooctane, decane, hexane, heptane, octane, undecane, dodecane, hexadecane, and cyclohexane). A color density map is used for regions with overlapping data points. Prior to optimization, using the SplitOH coefficients as suggested by Lin’s group,[45,46] we obtained RMSD = 1.92 with a correlation coefficient of 0.80. Using the optimized COSMO-SAC coefficients (shown in Table ), we calculated RMSD = 1.24 with a correlation coefficient of 0.81. This is a significant improvement, considering how diverse the data set is.

Figure 6

Computed partition coefficient, from ABC and IHB sampling, plotted against the experimental partition coefficient of the BioByte data set for different solvents. The red line corresponds to y = x. A color density map is used for regions with overlapping data points. Looking in more detail, taking each solvent separately into account, we notice that the model appears better for some solvents and not that adequate for others. For instance, heptane exhibits RMSD = 1.55, while hexadecane shows RMSD = 0.79. Heptane has more data points, but it is also the noisiest of the two. This noise does not appear to be a feature of the solvent; we have verified this for the same molecule, partitioning is almost identical for both solvents. This is to be expected because for all alkanes, the COSMO profile is essentially a delta-function, that is, all these hydrocarbon molecules have an almost zero COSMO charge density everywhere on the entire molecular surface. However, we have identified several molecules of rather exotic structures and functional groups being experimentally studied only in the case of heptane (like morin and xanthine). ABC is less accurate in predicting the partitioning of such molecules, which are more common in the heptane data set, making this particular solvent appear to be much more problematic and prone to error. This is an indication of how important it would be to improve upon functional groups that are not described with high accuracy in the current model. The scatter plots of hexadecane and heptane show a downward curvature for larger values of the partition coefficient, which could be attributed to a combination of conformational changes and insufficient alkyl–water parameterization. In the current ABC implementation, all conformations are obtained by AM1 minimization; no sampling is applied (apart from cases where IHBs occur and we apply a special sampling as explained in the IHB sampling section). We have checked that almost all compounds with large partition coefficients are hydrocarbons. Overall, the fit is entirely satisfactory. The attained accuracy is typical for what one can expect for oil–water partition coefficients from COSMO-RS. Still, here we have achieved this result by applying semi-empirical calculations (a few minutes per molecule) rather than full and much more expensive quantum DFT calculations (hours per molecule).[4]

Protocol Application to the SAMPL5 Data Set

We can, also, apply the developed protocol in predicting the distribution coefficient for water/cyclohexane of the SAMPL5 data set containing drug-like molecules. The distribution coefficient is given by the following relationwhere K is the partition coefficient of the neutral molecule and α is the neutral fraction of molecules at pH = 7.4; pK values are estimated with the program cxcalc from ChemAxon. As we observe in Figure our model gives RMSD = 1.88, slightly outperforming Klamt’s result of RMSD = 2.07, while our model correlation coefficient is equal to 0.81, comparable to Klamt’s 0.85. The overall comparison indicates that we can obtain similar results, if not improved, using ABC to calculate COSMO charge density and IHB sampling in the case of IHB. At this point, we need to stress that on comparing the RMSD with respect to the experimental data and uncertainty in measurements, the difference between our model and Klamt’s COSMO-RS is insignificant. Therefore, in the SAMPL5 challenge, our protocol would have performed similar to the winner.

Figure 7

Distribution coefficient calculated from ABC and IHB sampling, alongside the results from Klamt et al.[4] against the experimental data of the SAMPL5 challenge. The red line corresponds to y = x.

Distribution coefficient calculated from ABC and IHB sampling, alongside the results from Klamt et al.[4] against the experimental data of the SAMPL5 challenge. The red line corresponds to y = x. What deserves further investigation is that both ABC and Klamt’s model deviate from the experimental values in two ways: for polar molecules, partition coefficients are too low (log too negative), and for less polar molecules, the partition coefficients are systematically too high (log too positive). In fact, it seems as if the entire curve is shifted a few units in the positive direction. These two effects have been discussed in extenso by Klamt[4] and the SAMPL5 organizers,[5] without any clear conclusion. Other prediction methods of very different backgrounds (force fields) seem to suffer from this same phenomenon, so it could be concluded that this is partly an experimental effect. It could also be due to limited conformational sampling. Apart from molecules exhibiting IHBs, we did not use any conformational sampling. We simply took the conformation of the lowest AM1 energy in a vacuum. In contrast, force-field methods do sample conformations while Klamt also used conformational sampling. Considering that ABC performs similar to or better than all of them at a fraction of the computation time, the sampling issue remains a puzzle.

Discussion

To our knowledge, there is no similar large-scale comparison between the calculated and experimental partition coefficients. As far as we know, in works using molecular dynamics, only a handful or at the most a few tens (SAMPL5) of molecules are investigated.[52,53] These data sets are too small to draw statistically valid conclusions, and anyway, a molecular dynamics simulation on the scale of the BioByte database is not possible in a reasonable amount of time. On the QSAR level, we also found only a few studies: a study on ∼100 molecules using Abraham descriptors (a fine RMSD 0.2–0.3 but a rather poor correlation r2 0.5–0.7)[54] and an older QSAR study on ∼100 molecules by the group of Leo (the source of some of the experimental data we use here).[55] These studies have been limited to a few tens or a few hundred molecules at most, and apart from this, it is always challenging to extract physical insights from QSAR modeling. For the ubiquitous log P (octanol/water) predictions (COSMO-RS was recently used to predict octanol/water partitioning in small drug molecules[56]), the situation is very different and large-scale comparisons abound (see, e.g., the review[57] that compares methods based on a data set of more than 96,000 molecules). For hydrocarbon predictions, the situation is much less satisfactory. This is worrisome because on theoretical and experimental grounds,[58−63] one can expect that hydrocarbon partitioning is a much better predictor for bilayer permeability than log P: in the bilayer core, no hydrogen bonding with the surrounding lipid is possible and one is better off with a proxy solvent that cannot form hydrogen bonds. Several improvements could be introduced to enhance the accuracy of our model. A step in this direction could be to correct for specific functional groups whose behavior is not well captured using ABC. As illustrated in the Results, amines and sulfides are not significantly improved by the method. However, even for more frequently appearing and well-described groups like nitros, a serious improvement could be obtained. A representative molecule containing a nitro group is o-nitrophenol, which also happens to exhibit an IHB. After our IHB algorithm, we calculate the partition coefficient from the three different conformations, and finally, we estimate the total partitioning. As we can see in Table , the ab initio calculation is far closer to the experimental result than our semiempirical ABC method. This is a serious indication that a more detailed quantum description of the molecule could increase the accuracy of our model.

Table 2

Partition Coefficient of O-Nitrophenol Calculated Using ABC and DFT, Compared to the Experimental Data

	ABC + SplitOH (optimized)	DFT + SplitOH	experiment
Log K	–1.34	0.35	1.48

Conclusions

We have developed a fast and precise method to calculate COSMO charge density and, from that, the partition coefficient. At the core of the model, a bond charge correction is applied to the AM1 semi-empirical method, while special treatment for IHB is introduced. The charge shifts for the bond charge correction were optimized by an iterative process so that the COSMO charge density difference between the new model, ABC, and DFT results was minimized. For molecules that exhibit IHBs, we built an algorithm that generates the majority of the possible combinations of open and closed IHBs to avoid severe underestimation in the partitioning calculation. Subsequently, we re-parameterized some of the COSMO-SAC coefficients, namely, cbase, cOH–OH, cOH–OT, cOT–OT, and the segment surface area, against the experimental data of several thermodynamic and partitioning quantities. Having obtained the optimized COSMO-SAC coefficients, we applied the whole methodology to predict partition coefficients in the BioByte data set and distribution coefficients of drug-like molecules participating in the SAMPL5 challenge. The results were entirely satisfactory and in the case of SAMPL5, comparable to (and perhaps slightly better than) the competition winner[4] at a fraction of the computational cost, as ABC bypasses the expensive DFT computations. The fact that the model is so fast makes it a suitable candidate for large-scale screening studies of lead-like molecules as it provides fast and reliable results. The accuracy could be further improved, considering that ABC results are still inferior to DFT results for specific functional groups. Hence, in future studies, we will focus on finding quantum mechanical descriptors that can correct for the missing information related to the electronic structure of the interacting molecule.

38 in total

1. Direct calculation of 1-octanol-water partition coefficients from adaptive biasing force molecular dynamics simulations.

Authors: Navendu Bhatnagar; Ganesh Kamath; Issac Chelst; Jeffrey J Potoff
Journal: J Chem Phys Date: 2012-07-07 Impact factor: 3.488

2. Small Molecule Solvation Free Energy: Enhanced Conformational Sampling Using Expanded Ensemble Molecular Dynamics Simulation.

Authors: Andrew S Paluch; David L Mobley; Edward J Maginn
Journal: J Chem Theory Comput Date: 2011-07-27 Impact factor: 6.006

3. Calculating Partition Coefficients of Small Molecules in Octanol/Water and Cyclohexane/Water.

Authors: Caitlin C Bannan; Gaetano Calabró; Daisy Y Kyu; David L Mobley
Journal: J Chem Theory Comput Date: 2016-08-01 Impact factor: 6.006

Review 4. Calculation of molecular lipophilicity: State-of-the-art and comparison of log P methods on more than 96,000 compounds.

Authors: Raimund Mannhold; Gennadiy I Poda; Claude Ostermann; Igor V Tetko
Journal: J Pharm Sci Date: 2009-03 Impact factor: 3.534

5. Reduced surface: an efficient way to compute molecular surfaces.

Authors: M F Sanner; A J Olson; J C Spehner
Journal: Biopolymers Date: 1996-03 Impact factor: 2.505

6. Permeability of small nonelectrolytes through lipid bilayer membranes.

Authors: A Walter; J Gutknecht
Journal: J Membr Biol Date: 1986 Impact factor: 1.843

7. Molecular distributions in interphases: statistical mechanical theory combined with molecular dynamics simulation of a model lipid bilayer.

Authors: T X Xiang; B D Anderson
Journal: Biophys J Date: 1994-03 Impact factor: 4.033