Panagiotis C Petris1,2, Paul Becherer2, Johannes G E M Fraaije1,2. 1. Leiden Institute of Chemistry, Leiden University, P.O. Box 9502, 2300 RA Leiden, The Netherlands. 2. Siemens Industry Software Netherlands B.V., Galileiweg 8, 2333 BD Leiden, The Netherlands.
Abstract
We introduce a physics-based model for calculating partition coefficients of solutes between water and alkanes, using a combination of a semi-empirical method for COSMO charge density calculation and statistical sampling of internal hydrogen bonds (IHBs). We validate the model on the experimental partition data (∼3500 molecules) of small organics, drug-like molecules, and statistical assessment of modeling of proteins and ligand drugs. The model combines two novel algorithms: a bond-correction method for improving the calculation of COSMO charge density from AM1 calculations and a sampling method to deal with IHBs. From a comparison of simulated and experimental partition coefficients, we find a root-mean-square deviation of roughly one log 10 unit. From IHB analysis, we know that IHBs can be present in two states: open (in water) and closed (in apolar solvent). The difference can lead to a shift of as much as two log 10 units per IHB; not taking this effect into account can lead to substantial errors. The method takes a few minutes of calculation time on a single core, per molecule. Although this is still much slower than quantitative structure-activity relationship, it is much faster than molecular simulations and can be readily incorporated into any screening method.
We introduce a physics-based model for calculating partition coefficients of solutes between water and alkanes, using a combination of a semi-empirical method for COSMO charge density calculation and statistical sampling of internal hydrogen bonds (IHBs). We validate the model on the experimental partition data (∼3500 molecules) of small organics, drug-like molecules, and statistical assessment of modeling of proteins and ligand drugs. The model combines two novel algorithms: a bond-correction method for improving the calculation of COSMO charge density from AM1 calculations and a sampling method to deal with IHBs. From a comparison of simulated and experimental partition coefficients, we find a root-mean-square deviation of roughly one log 10 unit. From IHB analysis, we know that IHBs can be present in two states: open (in water) and closed (in apolar solvent). The difference can lead to a shift of as much as two log 10 units per IHB; not taking this effect into account can lead to substantial errors. The method takes a few minutes of calculation time on a single core, per molecule. Although this is still much slower than quantitative structure-activity relationship, it is much faster than molecular simulations and can be readily incorporated into any screening method.
In previous publications, we have discussed automated coarse-graining
protocols to calculate partition coefficients[1] and diffusion coefficients.[2] In this
work, we focus on providing a fast and reliable method for calculating
the alkane/water partition coefficient of lead-like molecules with
a molecular weight of up to 800. As far as modeling is concerned,
predicting large-molecule partitioning in a practical industrial setting
is extraordinarily difficult. One must be able to accurately calculate
the COSMO charge density distribution of molecules with a molecular
weight of up to 800 and then the free energy of partition (here by
COSMO-RS[3]) in a reasonable amount of time
(a few core minutes). Finally, internal hydrogen bonding (IHB) redistribution
should be considered, taking proper account of statistical thermodynamics.
In simple terms, when the molecule is dissolved in water, it is energetically
favorable to expose the IHB to the solvent, while in an organic solvent,
the molecule forms IHB to inhibit its interaction with the surrounding
apolar environment.In a recent statistical assessment of modeling
of proteins and
ligand (SAMPL5) competition to predict cyclohexane–water partition
coefficients of lead-like molecules,[4−6] it was found that COSMO-RS
beats the competition, including force-field methods, in both calculation
time and accuracy. From a medicinal chemist point of view, one has
only a few choices. The increasing size of drugs/drug candidates and
the need for fast estimation in medicinal chemistry screening require
significant advancements in ab initio(7) and force-field-based[8,9] methods to compete against
COSMO-RS,[4] UNIFAC,[10] and quantitative structure–activity relationship (QSAR) approaches.[11] QSAR approaches are the ones requiring the least
computational resources, although they are not always precise, and
most importantly, they do not provide physics-based insights to improve
on them. A statistical thermodynamic model like COSMO-RS or UNIFAC
can provide accurate results in relatively short time but cannot really
be used in any system without careful consideration. More specifically,
charged molecules cannot really be handled using COSMO-RS in its current
implementation. Certain functional groups are not described adequately
enough, regardless of the method used to extract COSMO charge densities.
Theoretically, these issues are resolved by using ab initio and force-field-based models. Such approaches will always be the
most accurate ones but very expensive computationally. Depending on
the system, the properties, and the application of interest, one should
decide which method to use cautiously, as this is the most important
choice in building a successful workflow.Despite their shortcomings,
COSMO-RS and UNIFAC that provide accurate
estimations considerably faster than force-field-based methods are
increasingly being used in the literature. In other domains, such
as octanol–water partition coefficients, one could perhaps
rely on massive amounts of data to build an ultra-fast machine learning
correlation or a QSAR model. However, for the systems of current interest,
drug-like molecules partitioning in water and alkane solvents, there
are simply not enough data to calibrate correlations. Hence, a statistical
approach cannot be adopted, and models must necessarily be physics-based.A suitable candidate can be found in the COSMO-RS model because
it requires only a handful of parameters, all with realistic physical
meaning, while it is, at the same time, precise. Unfortunately, in
standard COSMO-RS, one needs to calculate the charge envelope by performing
a costly density functional theory (DFT) calculation, which could
take many days for the molecular targets of our interest. Approximate
methods exist that combine fragments extracted from a library (COSMOfrag[12,13]), but the library and extraction method are not publicly available
nor described in sufficient detail to rebuild these methods.The second significant problem is that many lead-like molecules
exhibit IHBs as these amplify drug-design specificity. From analyzing
small molecules such as salicylic acid and o-nitrophenol,
it is clear that IHBs can have a dramatic effect on the partition
coefficient. The difference between open (in water) and closed (in
hydrocarbon solvent) state can account for roughly one to two log
10 unit per IHB.[14−16] When extended to large molecules (i.e., MW 800) that
can easily have 3–5, if not more, IHBs, the effect is dramatic,
and getting this right will make or break any prediction tool.The present work is structured specifically to resolve the two
issues mentioned above (fast estimation of COSMO charge density and
the IHB effect). In the Systems and Methods section, we describe the methodology that we have developed to predict
COSMO charge density in a fast and accurate way, using the AM1 semi-empirical
method[17−20] alongside a bond charge correction.[21,22] To cope with
the IHB effect, a separate algorithm was developed to generate the
conformations that include all the possible open (no IHB formed) and
closed (IHB formed) states for a given molecule. We also discuss the
way to extract the partition coefficient and the data sets used. In
the Results section, we present the COSMO
charge density as calculated from our newly developed method, AM1-bond
charge correction-COSMO (ABC), and we compare it with the corresponding
DFT outcome. We optimize COSMO-SAC coefficients against the thermodynamic
and partitioning data. We apply the ABC model with the IHB algorithm
and the optimized COSMO-SAC coefficients to the BioByte and SAMPL5
data set. In the Discussion section, we report
a few improvements that could make the model more accurate. The Supporting Information contains the BCC parameters
for ABC, the molecules extracted from the BioByte and SAMPL5 data
set, and the Kuhn patterns used to identify IHB. All algorithms, including
AM1, are implemented in CULGI software version 13.0,[23] installed on a Dell desktop PC equipped with an Intel Xeon
E5-2670 dual processor.
Systems and Methods
This section is separated into four different subsections explaining
the methods we have developed and the systems studied. We present
the “AM1-bond-charge-correction-COSMO” (AM1-BCC-COSMO,
ABC for short) algorithm we have developed for calculating the COSMO
charge density of a molecule. We include a detailed explanation of
the IHB algorithm we developed to generate all possible IHB conformers.
Subsequently, we illustrate the methodology for parameterizing COSMO-SAC
coefficients and the features and characteristics of the data sets
we applied the protocol to (CULGI,[1] NIST
standard reference database 103b as implemented in the NIST ThermoData
Engine (TDE) version 7.0, BioByte,[24] and
SAMPL5[4,25,26]). A schematic
representation of the implemented protocol can be found in the form
of a flowchart in Figure .
Figure 1
Protocol’s flowchart is used to optimize the ABC model,
generate IHB conformers, parameterize the COSMO-SAC coefficients,
and predict the SAMPL5 distribution coefficients (denoted by a dashed
rectangle).
Protocol’s flowchart is used to optimize the ABC model,
generate IHB conformers, parameterize the COSMO-SAC coefficients,
and predict the SAMPL5 distribution coefficients (denoted by a dashed
rectangle).
ABC Algorithm
We introduce a COSMO
charge density calculation method based on Dewar’s AM1,[17−20] a widely used and fast semi-empirical quantum method. Our implementation
of COSMO in AM1 is largely based on the method described by Klamt
and Schüürmann,[24] specifically
solving the linear system from eq 9 in that paper directly. However,
for calculating the interaction of the electron density (orbitals)
with the COSMO surface charges, we follow the implementation by Klamt
in MOPAC 7.[27−29] There, the electron density is approximated by a
collection of atom-centered multipoles (monopoles, dipoles, linear
quadrupoles, and square quadrupoles) in the same way as this is used
for calculating the two-center repulsion integrals in MNDO,[30,31] using the same parameters as MNDO. An additional approximation is
then made by treating the multipoles as point dipoles instead of physical
dipoles (taking the limit to zero distance between charges while keeping
the multipole moments constant). This leads to relatively simple analytical
expressions for the electrostatic potential arising from the electron
density. The explicit expressions can be found in the Supporting Information. Additionally, our implementation
of COSMO in AM1 generates smoother COSMO surfaces than MOPAC. The
construction of the solvent-excluded surface is based substantially
on the work of Sanner et al.[32]AM1
by itself does not give accurate electrostatic moments and as a result
gives unsatisfying COSMO charge density,[33] which we propose to repair by a QSAR type of approach. Our new ABC
algorithm is a bond charge correction that works similar to the BCC
method for force fields.[21,22] Compared to the force-field
BCC algorithm, we do not optimize for atomic partial charges but for
COSMO surface charge density. The BCC parameters are calculated by
comparing DFT to AM1 results; no experimental input is needed, but
instead, we can profit from matching vast amounts of purely computer-generated
data. The algorithm itself can be described as follows:Perform an AM1 calculation with COSMO
as usual, giving
the COSMO surface charge density σMAM1 of the molecule.Apply the AM1-BCC method (with modified parameters to
be determined) to obtain atom-centered correction charges qMBCCCalculate the COSMO surface charge density that results
from these correction charges alone, ΔσMBCC by solving separately the COSMO
equations using only the correction charges.Because the COSMO equations are linear, this COSMO correction
charge
density is added to the charge density from the original AM1-COSMO
calculation to obtain an improved COSMO surface charge density σMAM1-BCC =
σMAM1 +
ΔσMBCC. In this way, the improvement of the COSMO calculation is a straightforward
post-processing step that can be performed after a regular AM1-COSMO
calculation. The improved COSMO sigma profile can then be used in
COSMO-RS-type calculations, after re-optimizing the COSMO-RS parameters
for this source of sigma profiles.The modified AM1-BCC parameters
for this procedure are found by
fitting the parameters to minimize the difference between the AM1-BCC-COSMO
surface charge density σMAM1-BCC = σMAM1 + ΔσMBCC and the σMDFT resulting from
a full DFT COSMO calculation. This is in contrast with the force-field
BCC method, where the parameters are fitted to reproduce atomic partial
charges. For the optimization, we must construct an objective function
to be minimized, which expresses the deviation of the AM1-BCC-COSMO
results from the full DFT calculations. The surface charge density
is calculated on a discretized (triangulated) surface, leading to
a representation as a collection of local surface charge densities
σM,I,, one for each surface segment i in the molecule M. In the objective function, we
multiply each σ with the surface area a of the segment to properly weigh the size of the
surface segments. We, then, minimize numerically the following objective
functionMinimizing this function (the total square deviation) is equivalent
to minimizing the root-mean-square deviation. In order to calculate
the difference between the surface charge density distributions segment
by segment, we perform a COSMO calculation on the DFT BP86/def2-TZVP[34−39]-optimized geometry for both the full DFT calculation and the AM1-BCC-COSMO
calculation, using the same COSMO surface tessellation for both cases.The set of molecules that we used for this parameterization is
a subset of an in-house database of over 11,000 molecules (“the
Culgi molecular database”) that all have pre-computed COSMO
information, calculated with DFT as described above. To obtain a manageable
subset of this database while making sure that we have enough occurrences
of each BCC bond type, we followed this procedure:For each bond type in the list of
BCC bond types, we
determined the molecules that contain this type.If there were fewer than 10 molecules containing a bond
type, we did not include that bond type in our list. This means, for
example, that we cannot apply our AM1-BCC-COSMO on iodine-containing
molecules because we only have a handful of these in our entire database,
not enough for a meaningful parameterization of the iodine-containing
bond types. The more than 300 bond types that were present in the
original AM1-BCC scheme were thus reduced to 149 types. Of these 149
types, there are 17 that are bonds between atoms of identical type,
where the correction charge is necessarily 0. That leaves 132 non-trivial
bond types. These parameters suffice for the vast majority of molecules
in the Culgi Molecules Database.For
each bond type that has 10 or more molecules containing
it, 10 molecules with the lowest molecular weight were selected.The overlap between these sets of 10 molecules
left
us with 965 molecules to parameterize 132 non-trivial bond charge-correction
parameters.The objective function was
minimized using an L-BFGS-B minimizer
as implemented in the SciPy library for the Python programming language.The resulting set of bond charge corrections can be found in the Supporting Information. A more quantitative analysis
of the results of applying this to the entire Culgi Molecules Database
is presented in the Results section. However,
an indicative demonstration of how well the new model works is provided
in Figure . We show
the COSMO charge density distribution of propanol calculated from
AM1, ABC, and DFT, respectively. It is evident that ABC (red line)
is far closer to DFT results (blue line) than plain AM1 (black line),
even for a molecule without special and exotic functional groups,
such as propanol.
Figure 2
COSMO charge density distribution of propanol calculated
using
(a) AM1 semiempirical method, (b) ABC model, and (c) DFT with the
modified NWChem software package.[40]
COSMO charge density distribution of propanol calculated
using
(a) AM1 semiempirical method, (b) ABC model, and (c) DFT with the
modified NWChem software package.[40]DFT calculations were carried out with a modified
version of NWChem
6.3,[40] with an improved COSMO mesh generation
based substantially on the work of Sanner et al.,[32] and with outlying charge correction.[41] Molecules were geometry-optimized in vacuum, using the
def2-TZVP basis set[34,37] and BP86 exchange–correlation
functional.[36−38] A COSMO calculation was then performed on that geometry
using the def2-TZVPD basis set.[34,35,39] These are the recommended high-quality settings for COSMO-RS as
used in COSMOtherm software[42] and also
used in the entry of Klamt et al. in the SAMPL5 challenge.
COSMO-RS Partition Coefficients
The
ABC algorithm is completed by reparameterization of some of the COSMO-SAC
coefficients by fitting four sets of thermodynamic and partitioning
data. Our implementation is based on COSMO-SAC,[43,44] but because COSMO-SAC and COSMO-RS differ only cosmetically, we
will use the name COSMO-RS throughout the paper. With the ABC method,
we can calculate the COSMO charge density of even a large molecule
(MW 800, the upper range of SAMPL5 molecules) in just a few minutes
on a single core. Afterward, one can easily derive the partition coefficient
in less than a second.The partition coefficient is defined
as the ratio of concentrations in the two solvents as followswhere Vwater and Voil are the molar volume of water
and oil, respectively,
while the difference in chemical potentials Δμ0 is calculated from COSMO-RS through the activity coefficients γ,
viawhere a is the surface area and σ is
the COSMO surface charge density on segment i. When
two or more conformations of the same molecule are included
in the partition coefficient calculation, the corresponding relation
is transformed. For two conformations A and B, in oil/water solvents,
the cumulative partition coefficient is provided by the following
equationwhere Eoil(A)
and Ewater(A) are the total energy required
for the conformation A to move from the perfect conductor state into
a solvent like oil or water, respectively (same for conformation B).
Such a quantity for a conformer i in a solvent S can be calculated aswhere ECOSMO(i) is the
total quantum mechanical energy of the conformer i in the perfect conductor and μS(i) is the chemical potential of the conformer i in
solvent S that accounts for the transition of this conformer
from a perfect conductor into the solvent S.COSMO-RS theory
amounts to the integration over the surface, of
a local free energy difference for a charged segment exposed to two
different solvents Δf = foil – fwater. The local
free energy function differs between various COSMO-RS flavors; in
most cases, as we do here, one accommodates the polarization energy
of a charged segment in a conductor and a hydrogen bonding term. In
the limited sense of integration over the surface, the model is not
too dissimilar from QSAR, where one integrates over different types
of polar groups and hydrogen donors and acceptors. The essential difference
is that in COSMO-RS, one does not employ a group description and group
parameters but a generic interaction model based on electrostatics
from quantum chemistry calculations, and therefore, one needs only
a few physics-based parameters, instead of a whole zoo of groups and
group parameters.In this paper, we use the “SplitOH”
COSMO-SAC model
from Lin’s group[45,46] (a recent approach
by Chang et al.[47] suggests directional
interactions for a more accurate representation of hydrogen bonding).
The Staverman–Guggenheim combinatorial term is used. Instead
of taking the published Lin parameters as is, we re-optimized the
COSMO-SAC coefficients. This was carried out by minimizing the total
root-mean-square error for four sets of thermodynamic and partitioning
data:400 pressure data points
for binary vapor–liquid
equilibria (VLEs)400 excess Gibbs free
energies for 50/50 liquid binary
mixtures (derived from VLE data)400
partition coefficients for octanol/water212 partition coefficients for hexadecane/waterVLE data were obtained from the NIST standard reference database
103b as implemented in the NIST TDE version 7.0. Partition coefficients
were obtained from the BioByte data set.[24] More information related to the parameterization scheme can be found
in the Supporting Information. The optimized
settings can be seen in the following table where cbase defines the interaction between any related segment, cOH–OH provides the (additional) interaction
between segments with hydroxyl groups, cOH–OT gives the (additional) interaction between a segment containing
a hydroxyl group and a segment with a different hydrogen bonding acceptor,
and cOT–OT represents the interaction
between two segments that do not contain any hydroxyl group. The segment
surface area is the size of a thermodynamically independent segment.
This is equivalent to the inverse of the so-called “coordination
number” in chemical engineering models.One can use the
suggested coefficients in Table alongside a software package supporting
COSMO-SAC calculations. An open source implementation of COSMO-SAC
has recently become available.[48]
Table 1
Optimized COSMO-SAC Coefficients
cbase (kcal/mol)·(Å/e2)
cOH–OH (kcal/mol)·(Å/e2)
cOH–OT (kcal/mol)·(Å/e2)
cOT–OT (kcal/mol)·(Å/e2)
segment surface area
(Å2)
6032.98
3175.29
2209.34
940.67
7.58
IHB Sampling
The special treatment
of IHB in a molecule is related to the different conformations that
the molecule can form when dissolved in a polar or apolar solvent.
More specifically, if a molecule that exhibits an IHB is dissolved
in a polar solvent, the acceptor (hydroxyl, amine, etc.) and the donor
would be exposed to the solvent, the so-called “open”
conformation. If dissolved in an apolar solvent (i.e., alkane), the
molecule tends to form IHB so that the acceptor and donor are hidden
from the solvent; this is the “closed” conformation.
Therefore, we have conducted conformation analysis for molecules exhibiting
IHB, so that all possible combinations of open and closed IHB structures
will be included in the final analysis. Such analysis ensures that
the effect of the solvent on the molecular conformation is explored
and accounted for.Following the work of Kuhn et al.[49] we have used the Kuhn rings to identify the
presence of IHB in a molecule. The patterns do not include the trans-core
IHBs that can be found in macrocycles, and we therefore omitted all
such molecules from the present study. We have included an additional
pattern for the IHBs between carboxylic groups and aromatic nitrogen.
Finally, a few additional rings are included to make sure that we
capture the IHBs present in smaller molecules like salicylic acid
and o-nitrophenol.The second challenge in
IHB sampling is to find an automatic way
of reproducing the conformations that would include all the combinations
of open and closed IHBs. We start by assigning zero charges to all
atoms, apart from the donor and acceptor atoms that take part in the
formation of the IHB. To the donor/acceptor couple, we assign opposite
charges if the closed conformation should be formed (attractive electrostatic
interaction) or equal charges if the open conformation should be formed
(repulsive electrostatic interaction). Using these charges, we run
successive molecular dynamics runs followed by energy minimization,
until a hydrogen bond distance criterion is met (distance below a
certain value if an IHB is to be formed and distance above a certain
value if the atoms should not form an IHB). Each molecular dynamics
run consists of 1000 steps. We have selected an NVT ensemble with the Andersen thermostat, at a temperature of 298 K,
with a 0.5 fs timestep coupled with the Dreiding force field.As the second step, we re-assign the correct charges and geometrically
optimize the conformation, so that we make sure that the structure
is energetically favorable. This way, after the optimization step,
some of the conformations might end up without the desirable IHB,
although the vast majority of them will maintain the desired structure.
Then, we apply AM1 geometry optimization to each conformation to ensure
that we have a physically meaningful representation of the molecule.
The COSMO charge density is then calculated using ABC for every conformation.
Finally, we calculate the partitioning of the molecule between water
and alkane using eq .A characteristic example can be found in salicylic acid. Figure illustrates the
two different conformations that should be taken into account when
calculating the partition coefficient of this molecule in water/alkane
solvents. When the molecule is dissolved in water, the “open”
conformation, depicted in Figure a, is practically the only one observed. In contrast,
when dissolved in alkane solution, the “closed” conformation,
represented in Figure b where an IHB is formed, is predominant. The COSMO surface is visualized
in the form of a color map. The red surface areas correspond to a
positive COSMO charge density, while the blue areas carry a negative
COSMO charge density. The gray COSMO surface indicates neutral COSMO
charge density.
Figure 3
Salicylic acid’s (a) open conformation, favorable
in an
aqueous environment, and (b) closed conformation, favorable in alkane
solution. The COSMO surface is depicted in the form of a color map.
Salicylic acid’s (a) open conformation, favorable
in an
aqueous environment, and (b) closed conformation, favorable in alkane
solution. The COSMO surface is depicted in the form of a color map.The severe effect of leaving out the closed conformation
is evident
when estimating the salicylic acid partition coefficient in water/cyclohexane.
Using eq , the open
conformation partition coefficient is equal to −5.43, while
the experimental one is −1.97. If we use eq to include both conformations, we get to
the value of −2.54, which is in good agreement with the experiment.
Data sets
We used the BioByte database[24] as the primary source for partition coefficients
between water and alkane solvents (hexane, cyclohexane, heptane, isooctane,
octane, decane, undecane, dodecane, and hexadecane). We have curated
the data set manually: for the solutes with data in more than one
solvent, we only accepted those entries where log 10 deviated less
than 1.5 unit between the lowest and highest. We accepted all entries
for which only one solvent is present, for lack of a good curation
criterion. We have also omittedall tautomer-forming molecules (as they need to be separately
investigated for different solvents, while we have not developed a
special model for their treatment),all
molecules with more than five rotatable bonds (because
we focus mostly on the residual contribution, so we limit surface
flexibility and folding effects), andall crown ethers, where water binding should be taken
into account due to the extreme hydrogen bonding behavior.The OpenBabel package is utilized to generate
the molecular
three-dimensional conformations from their corresponding SMILES strings.
Subsequently, we stretch the molecule open to prevent energetically
unfavorable starting conformations; this also avoids unphysical steric
and intramolecular effects. To do so, we apply molecular dynamics
without charges to open the molecule, followed by energy minimization
with the corresponding force-field charges to get a molecule with
realistic geometry. When the same molecule is encountered again for
different solvents, we keep the generated conformation from the earlier
occurrences to ensure that statistical mechanics from the molecular
dynamics steps will not sample different conformations for different
alkane solutes, hence providing different results.As discussed
in the previous section, a further distinction is
made between the molecules with IHB (a few hundred) and those without
IHB (a few thousand). The molecules are listed in the Supporting Information. A second set is the SAMPL5
data set of drugs and drug-like candidates.[4,25,26] In Figure , we present the molecular weight distribution for
the BioByte and SAMPL5 data set. The average of the molecular weight
distribution for the BioByte data set is equal to 180 and for SAMPL5
is equal to 290. This difference is expected because BioByte contains
many smaller solutes, while SAMPL5 includes larger lead-like molecules.
Figure 4
Molecular
weight distribution for BioByte and SAMPL5 data sets.
Molecular
weight distribution for BioByte and SAMPL5 data sets.
Results
In this section, we demonstrate
how we acquired the correction
charges for the ABC model for some of the most representative functional
groups. Subsequently, we have applied the protocol (ABC alongside
IHB sampling) to predict the partition and distribution coefficients
in the BioByte and SAMPL5 data sets, respectively.
ABC Optimization
As explained in
the Systems and Methods section, we optimized
the ABC model against DFT results for a wide range of functional groups
found in the CULGI database (extensively analyzed in previous publications[1]), such that the surface charge density disparity
between ABC and DFT is minimized. The quadratic error in the surface
charge density between AM1 and DFT and between ABC and DFT (for a
few of the most frequently appearing functional groups) is represented
using boxplots in Figure .
Figure 5
Quadratic surface charge density error between DFT and (a) AM1
and (b) ABC for a selection of different functional groups represented
in a boxplot fashion.
Quadratic surface charge density error between DFT and (a) AM1
and (b) ABC for a selection of different functional groups represented
in a boxplot fashion.By comparing either the
median value or the relative position of
a boxplot between AM1 and DFT (Figure a) and between ABC and DFT (Figure b), one can see that for all functional groups,
apart from sulfides, there is a moderate to substantial improvement
using ABC instead of AM1. We can see the molecules with hypervalent
sulfur and phosphorus in the upper regions of Figure a: “sulfonamide” and “phosphate”—the
latter including thiophosphates. In Figure b, these groups improve significantly, with
errors ending up in the same range as more “regular”
molecules. It is known that AM1 does not deal well with these atom
types, and this is also clear from looking at the COSMO surface of
these molecules when comparing the DFT COSMO surface charge density
with the AM1COSMO surface charge density, as we show graphically
in the Supporting Information for the pyridine-3-sulfonamide
molecule. The nitriles show a moderately large improvement. In general,
for large molecules, the dominant source of error is often formed
by large alkyl tails.The values of the optimized parameters
also bear this out. For
the bonds involving hypervalent S and P, the charge shift is often
as large as 0.4 to even 0.9 electron charges. For nitriles and similar
groups, this may still be up to 0.1 to 0.2, but in the vast majority
of other bonds, the charge displacement is below that value. To quantify
our metric in familiar COSMO terms, if there is a difference in σerror of one unit between the estimation of AM1 and ABC, this
corresponds to approximately 0.85 kBT difference in the
total surface segment energy. Because the introduced bond charge correction
significantly improves AM1 results, we are justified to apply it in
calculating the COSMO charge density and, subsequently, the partition
coefficient.
Protocol Application to
the BioByte Data Set
We have applied the ABC model to the
BioByte data set to predict
the alkane/water partition coefficient. For molecules that exhibit
IHBs, we applied the IHB sampling algorithm as described in the Systems and Methods section.For molecules
containing carboxyl groups, we use only the syn conformation (with
the hydrogen pointing in roughly the same direction as the carbonyl
oxygen) and not the anti-conformation (with the hydrogen pointing
away) because the former is the most stable and energetically favorable
one as far too polar behavior is observed when we sample energetically
unfavorable carboxyl anti-conformations (donor and acceptor exposed
to the solvent). The necessity of special treatment for carboxyl groups
has already been demonstrated in force-field-based works in the literature.[50,51]Figure compares
the experimental Kexp and calculated Kcalc partition coefficients for each hydrocarbon
solvent (namely, isooctane, decane, hexane, heptane, octane, undecane,
dodecane, hexadecane, and cyclohexane). A color density map is used
for regions with overlapping data points. Prior to optimization, using
the SplitOH coefficients as suggested by Lin’s group,[45,46] we obtained RMSD = 1.92 with a correlation coefficient of 0.80.
Using the optimized COSMO-SAC coefficients (shown in Table ), we calculated RMSD = 1.24
with a correlation coefficient of 0.81. This is a significant improvement,
considering how diverse the data set is.
Figure 6
Computed partition coefficient,
from ABC and IHB sampling, plotted
against the experimental partition coefficient of the BioByte data
set for different solvents. The red line corresponds to y = x. A color density map is used for regions with
overlapping data points.
Computed partition coefficient,
from ABC and IHB sampling, plotted
against the experimental partition coefficient of the BioByte data
set for different solvents. The red line corresponds to y = x. A color density map is used for regions with
overlapping data points.Looking in more detail,
taking each solvent separately into account,
we notice that the model appears better for some solvents and not
that adequate for others. For instance, heptane exhibits RMSD = 1.55,
while hexadecane shows RMSD = 0.79. Heptane has more data points,
but it is also the noisiest of the two. This noise does not appear
to be a feature of the solvent; we have verified this for the same
molecule, partitioning is almost identical for both solvents. This
is to be expected because for all alkanes, the COSMO profile is essentially
a delta-function, that is, all these hydrocarbon molecules have an
almost zero COSMO charge density everywhere on the entire molecular
surface.However, we have identified several molecules of rather
exotic
structures and functional groups being experimentally studied only
in the case of heptane (like morin and xanthine). ABC is less accurate
in predicting the partitioning of such molecules, which are more common
in the heptane data set, making this particular solvent appear to
be much more problematic and prone to error. This is an indication
of how important it would be to improve upon functional groups that
are not described with high accuracy in the current model.The
scatter plots of hexadecane and heptane show a downward curvature
for larger values of the partition coefficient, which could be attributed
to a combination of conformational changes and insufficient alkyl–water
parameterization. In the current ABC implementation, all conformations
are obtained by AM1 minimization; no sampling is applied (apart from
cases where IHBs occur and we apply a special sampling as explained
in the IHB sampling section). We have checked that almost all compounds
with large partition coefficients are hydrocarbons. Overall, the fit
is entirely satisfactory. The attained accuracy is typical for what
one can expect for oil–water partition coefficients from COSMO-RS.
Still, here we have achieved this result by applying semi-empirical
calculations (a few minutes per molecule) rather than full and much
more expensive quantum DFT calculations (hours per molecule).[4]
Protocol Application to
the SAMPL5 Data Set
We can, also, apply the developed protocol
in predicting the distribution
coefficient for water/cyclohexane of the SAMPL5 data set containing
drug-like molecules. The distribution coefficient is given by the
following relationwhere K is the partition
coefficient of the neutral molecule and α is the neutral fraction
of molecules at pH = 7.4; pK values are estimated
with the program cxcalc from ChemAxon.As we observe in Figure our model gives
RMSD = 1.88, slightly outperforming Klamt’s result of RMSD
= 2.07, while our model correlation coefficient is equal to 0.81,
comparable to Klamt’s 0.85. The overall comparison indicates
that we can obtain similar results, if not improved, using ABC to
calculate COSMO charge density and IHB sampling in the case of IHB.
At this point, we need to stress that on comparing the RMSD with respect
to the experimental data and uncertainty in measurements, the difference
between our model and Klamt’s COSMO-RS is insignificant. Therefore,
in the SAMPL5 challenge, our protocol would have performed similar
to the winner.
Figure 7
Distribution coefficient calculated from ABC and IHB sampling,
alongside the results from Klamt et al.[4] against the experimental data of the SAMPL5 challenge. The red line
corresponds to y = x.
Distribution coefficient calculated from ABC and IHB sampling,
alongside the results from Klamt et al.[4] against the experimental data of the SAMPL5 challenge. The red line
corresponds to y = x.What deserves further investigation is that both ABC and
Klamt’s
model deviate from the experimental values in two ways: for polar
molecules, partition coefficients are too low (log too negative),
and for less polar molecules, the partition coefficients are systematically
too high (log too positive). In fact, it seems as if the entire curve
is shifted a few units in the positive direction. These two effects
have been discussed in extenso by Klamt[4] and the SAMPL5 organizers,[5] without any
clear conclusion. Other prediction methods of very different backgrounds
(force fields) seem to suffer from this same phenomenon, so it could
be concluded that this is partly an experimental effect. It could
also be due to limited conformational sampling. Apart from molecules
exhibiting IHBs, we did not use any conformational sampling. We simply
took the conformation of the lowest AM1 energy in a vacuum. In contrast,
force-field methods do sample conformations while Klamt also used
conformational sampling. Considering that ABC performs similar to
or better than all of them at a fraction of the computation time,
the sampling issue remains a puzzle.
Discussion
To our knowledge, there is no similar large-scale comparison between
the calculated and experimental partition coefficients. As far as
we know, in works using molecular dynamics, only a handful or at the
most a few tens (SAMPL5) of molecules are investigated.[52,53] These data sets are too small to draw statistically valid conclusions,
and anyway, a molecular dynamics simulation on the scale of the BioByte
database is not possible in a reasonable amount of time. On the QSAR
level, we also found only a few studies: a study on ∼100 molecules
using Abraham descriptors (a fine RMSD 0.2–0.3 but a rather
poor correlation r2 0.5–0.7)[54] and an older QSAR study on ∼100 molecules
by the group of Leo (the source of some of the experimental data we
use here).[55] These studies have been limited
to a few tens or a few hundred molecules at most, and apart from this,
it is always challenging to extract physical insights from QSAR modeling.For the ubiquitous log P (octanol/water) predictions
(COSMO-RS was recently used to predict octanol/water partitioning
in small drug molecules[56]), the situation
is very different and large-scale comparisons abound (see, e.g., the
review[57] that compares methods based on
a data set of more than 96,000 molecules). For hydrocarbon predictions,
the situation is much less satisfactory. This is worrisome because
on theoretical and experimental grounds,[58−63] one can expect that hydrocarbon partitioning is a much better predictor
for bilayer permeability than log P: in the bilayer
core, no hydrogen bonding with the surrounding lipid is possible and
one is better off with a proxy solvent that cannot form hydrogen bonds.Several improvements could be introduced to enhance the accuracy
of our model. A step in this direction could be to correct for specific
functional groups whose behavior is not well captured using ABC. As
illustrated in the Results, amines and sulfides
are not significantly improved by the method. However, even for more
frequently appearing and well-described groups like nitros, a serious
improvement could be obtained. A representative molecule containing
a nitro group is o-nitrophenol, which also happens
to exhibit an IHB. After our IHB algorithm, we calculate the partition
coefficient from the three different conformations, and finally, we
estimate the total partitioning. As we can see in Table , the ab initio calculation is far closer to the experimental result than our semiempirical
ABC method. This is a serious indication that a more detailed quantum
description of the molecule could increase the accuracy of our model.
Table 2
Partition Coefficient of O-Nitrophenol
Calculated Using ABC and DFT, Compared to the Experimental
Data
ABC + SplitOH (optimized)
DFT + SplitOH
experiment
Log K
–1.34
0.35
1.48
Conclusions
We have developed a fast and precise method to calculate COSMO
charge density and, from that, the partition coefficient. At the core
of the model, a bond charge correction is applied to the AM1 semi-empirical
method, while special treatment for IHB is introduced.The charge
shifts for the bond charge correction were optimized
by an iterative process so that the COSMO charge density difference
between the new model, ABC, and DFT results was minimized. For molecules
that exhibit IHBs, we built an algorithm that generates the majority
of the possible combinations of open and closed IHBs to avoid severe
underestimation in the partitioning calculation.Subsequently,
we re-parameterized some of the COSMO-SAC coefficients,
namely, cbase, cOH–OH, cOH–OT, cOT–OT, and the segment surface area,
against the experimental data of several thermodynamic and partitioning
quantities. Having obtained the optimized COSMO-SAC coefficients,
we applied the whole methodology to predict partition coefficients
in the BioByte data set and distribution coefficients of drug-like
molecules participating in the SAMPL5 challenge. The results were
entirely satisfactory and in the case of SAMPL5, comparable to (and
perhaps slightly better than) the competition winner[4] at a fraction of the computational cost, as ABC bypasses
the expensive DFT computations.The fact that the model is so
fast makes it a suitable candidate
for large-scale screening studies of lead-like molecules as it provides
fast and reliable results. The accuracy could be further improved,
considering that ABC results are still inferior to DFT results for
specific functional groups. Hence, in future studies, we will focus
on finding quantum mechanical descriptors that can correct for the
missing information related to the electronic structure of the interacting
molecule.
Authors: Michael R Shirts; Christoph Klein; Jason M Swails; Jian Yin; Michael K Gilson; David L Mobley; David A Case; Ellen D Zhong Journal: J Comput Aided Mol Des Date: 2016-10-27 Impact factor: 3.686
Authors: Ariën S Rustenburg; Justin Dancer; Baiwei Lin; Jianwen A Feng; Daniel F Ortwine; David L Mobley; John D Chodera Journal: J Comput Aided Mol Des Date: 2016-10-07 Impact factor: 3.686