A new molecular descriptor, nConf20, based on chemical connectivity, is presented which captures the accessible conformational space of a molecule. Currently the best available two-dimensional descriptors for quantifying the flexibility of a particular molecule are the rotatable bond count (RBC) and the Kier flexibility index. We present a descriptor which captures this information by sampling the conformational space of a molecule using the RDKit conformer generator. Flexibility has previously been identified as a key feature in determining whether a molecule is likely to crystallize or not. For this application, nConf20 significantly outperforms previously reported single-variable classifiers and also assists rule-based analysis of black-box machine learning classification algorithms.
A new molecular descriptor, nConf20, based on chemical connectivity, is presented which captures the accessible conformational space of a molecule. Currently the best available two-dimensional descriptors for quantifying the flexibility of a particular molecule are the rotatable bond count (RBC) and the Kier flexibility index. We present a descriptor which captures this information by sampling the conformational space of a molecule using the RDKit conformer generator. Flexibility has previously been identified as a key feature in determining whether a molecule is likely to crystallize or not. For this application, nConf20 significantly outperforms previously reported single-variable classifiers and also assists rule-based analysis of black-box machine learning classification algorithms.
One of the key steps
in approaching a cheminformatics problem is
the definition of the “chemical space” used to describe
the problem.[1] The set of numerical descriptors
chosen to capture the characteristics of molecules defines the basis
vectors of this space, with n linearly independent
descriptors giving rise to an n-dimensional space,
in which the coordinates of a particular molecule are given by the
values of the descriptors for that molecule. Careful selection of
these descriptors provides a useful chemical space for data visualization,
similarity measures, and classification or clustering algorithms.Descriptors can be broadly categorized according to their “dimensionality”,
based on the type of molecular representation used to calculate them.[2] Zero-dimensional descriptors can be calculated
directly from the molecular formula, e.g. molecular weight, while
one-dimensional descriptors are bulk properties of the molecule, e.g.
calculated solubility.[3] Two-dimensional
descriptors, such as connectivity indices and properties of the molecular
bond graph, are calculated from a traditional two-dimensional representation
of the molecule. Three-dimensional descriptors are computed from a known conformation of a molecule
and capture features of a molecule such as shape, distribution of
charges, and radius of gyration.[4]It has previously been shown that increased flexibility can reduce
the crystallization tendency of a molecule[5] and that rotatable bond count (RBC), a 2D descriptor based on a
set of SMARTS pattern matching rules, was an important feature of
molecules for the determination of how easily a molecular material
can be crystallized.[6] This may be due to
the reduced effective concentration of the “correct”
crystallizing conformer in solution for a molecule with more rotatable
bonds.[5] However, the exact mechanisms of
nucleation and growth of crystals and the influence of molecular conformation
are still not fully understood.[7,8]RBC is quite a
crude approximation of molecular flexibility; Kier
devised a way of encoding this attribute based on the chemical graph,
but this uses descriptors which are also based solely on two-dimensional
information.[9]Other studies of conformational
flexibility have been computationally
expensive because they attempt to evaluate the entire potential energy
hypersurface,[10,11] which is impractical for a large
number of molecules. They also either do not yield a single value
which can be correlated with a physical property of the molecule or
are only appropriate for use with a specific subset of chemical space,
such as alkanes.[12,13]A more direct measure could
be obtained by designing a descriptor
based on a sampling of the energetically accessible conformers of
a given molecule. This approach is relatively quick to compute as
it only involves finding the minima of the potential energy surface,
and it falls into the 2D category of descriptors outlined above since
its value depends only on the chemical connectivity of a molecule
but will capture 3D information about the number of low energy conformers
of that molecule. This eliminates the computation of the barriers
to interconversion and assumes that all conformers are energetically
accessible on the time scales of crystallization. Implementation of
this new “3D from 2D” descriptor and its application
to the problem of predicting if a molecule will be observed to crystallize
is described herein.
Methods
Conformer Generation
Molecules were provided to the
conformer-generation step as SMILES strings to ensure no residual
conformational information was retained, and explicit hydrogen atoms
were added to the skeleton. RDKit cheminformatics toolkit[14] functions were used to generate 50 random molecular
conformations, while retaining the starting stereochemistry. RDKit
was chosen over other open-source conformer generation tools like
BALLOON, CONFAB, and FROG2 and commercial platforms such as MOE, due
to speed and the ability to generate conformers which are structurally
similar to experimentally determined structures.[15] A knowledge-based conformer generator which uses experimental
observations of torsional angle distributions is available in the
latest release of the Cambridge Structural Database (CSD) tools.[16] These alternatives have not been explored in
this work but could potentially be used to sample conformational space
in a similar manner to RDKit.Each randomly generated conformer
was optimized using the Merck Molecular Force Field (MMFF94).[17] MMFF94 is a general purpose parametrized force
field comprised of several well-defined contributions to the total
potential energy of a molecule, including bond stretching energy,
bond torsion, and electrostatic and van der Waals interaction energies.
The force field parametrization is determined by training on a large
set of computational data derived from ab initio calculations
on a diverse range of organic and bioorganic structures and has been
implemented within the RDKit.[18] Some other
force fields suitable for organic molecules include Amber, Gaff, and
CHARMM.[19] The Universal Force Field (UFF)
can be used to compute energies and gradients of molecules containing
almost any element and may therefore prove useful if extending this
work to metal–organic complexes or inorganic molecular materials.
MMFF94 has been shown to reproduce gas-phase conformer energies more
accurately than these other widely available force fields[20] and was chosen for its significantly shorter
computational time compared to a more accurate molecular dynamics
calculation including solvent effects.If the optimization did
not converge to a stable minimum the conformer
was removed. The force field is then used to calculate the energy
of each conformer; its energy relative to the lowest-energy conformer
found is stored. The lowest energy conformer is retained, and for
each other conformer the alignment of all permutations of matching
atom orders with the other conformers is checked, to account for symmetry.
Any duplicate conformers with a heavy atom root-mean-square (RMS)
distance of less than 1.0 Å to any other conformer are removed.For the small minority (0.05%) of molecules where the MMFF optimization
failed, the molecule was removed from the study.The entire
calculation of the energies takes around 0.2 s for molecules
with fewer than two rotatable bonds, 1–2 s for molecules with
4 or 5 rotatable bonds, and up to 5 s for molecules with 8 rotatable
bonds.
Predictive Model
CSD molecules were obtained from crystal
structures in the 2016 CSD release (version 5.37), while lists of
commercially available molecules were obtained from ZINC15 downloaded
in May 2016. 177 molecular descriptors were calculated using the RDKit
cheminformatics toolkit,[14] version Q1 2016.
Machine learning algorithms and performance metrics were implemented
using version 17.0 of the scikit-learn package.[21] The descriptor definitions and an example of the method
used to train a model and output a predictive accuracy from a set
of training and test molecules with known labels are given in the
Supporting Information of Wicker and Cooper (see http://www.rsc.org/suppdata/ce/c4/c4ce01912a/c4ce01912a1.pdf).[6]Training and test molecules
were selected as in a previous study[6] using
information extracted from the ZINC[22] database
and the CSD.[16] In this instance, no drug-like
filter was applied, to include all organic molecules, which resulted
in a set of 48112 commercially available molecules of which 36083
were used for training and 12029 were reserved for a test set. Half
of each set consisted of “observed to crystallize” molecules
(found in both ZINC and the CSD) and the other half consisted of “not
observed to crystallize” molecules (found only in ZINC).Support vector machines (SVMs) were used as the machine learning
algorithm to create the predictive model using the molecular descriptors,
having previously been found to give the best performance for this
classification problem.[6]
Rule Extraction
The “black-box” nature
of the nonlinear SVM predictive model prevents direct determination
of the most important descriptors used in performing the classification.[23] Two methods were used to identify these descriptors.
Single
Variable Classifiers
The descriptors which are
calculated by RDKit were each used in turn to create a single variable
classifier built in order to find the descriptor which gave the best
predictive accuracy and therefore the most effective classification.[24] The accuracy was assessed both by 5-fold cross-validation
on the training set and by prediction on an external validation set.
This approach can be extended to two (or more) variable classifiers.
Decision Tree
Rule extraction techniques can be used
to mimic the SVM model as closely as possible in order to infer how
it is performing the classification.[25] This
approach has been used to extract simple rules from machine-learning
models which were trained to classify reaction outcomes.[26] The SVM algorithm is trained as usual on the
training data set, and the resulting model is then used to obtain
the predicted labels for the training data. A conventional decision
tree classifier is then trained on the predicted labels to represent
the SVM predictive model in terms of a rule-based decision tree.
Descriptor
A new single value descriptor was developed
based on the distribution of relative conformer energies. The new
descriptor is a count of additional conformers (not including the
lowest energy conformer) with energies between selected relative energy
thresholds and is designed to approximate the number of energetically
accessible conformations of a molecule.In order to find the
optimal energy thresholds for the descriptor, a 5-fold cross-validation
was carried out on the training set using the descriptor to create
a single variable classifier. Figure shows the distribution of accuracies, which has a
broad maximum between an upper threshold of 16 to 20 kcal/mol and
a lower energy threshold of 0 to 1 kcal/mol, with no significant difference
between the predictive accuracies. This led to a choice of 0 as the
lower threshold and 20 as the upper threshold. An example of calculating
this descriptor using a 20 kcal/mol cutoff is given in Table .
Figure 1
Predictive accuracies for the conformer energy descriptor with
varying limits, as determined by 5-fold cross-validation.
Table 1
Example
nConf20 Calculation
for CSD Refcode TERLUXa
The
lowest energy conformer is
not counted, and conformer 41 is above the energy threshold, giving
an nConf20 value of 4.
The
lowest energy conformer is
not counted, and conformer 41 is above the energy threshold, giving
an nConf20 value of 4.Predictive accuracies for the conformer energy descriptor with
varying limits, as determined by 5-fold cross-validation.
Results and Discussion
Figure shows that
rotatable bond count and nConf20 capture similar but slightly
different information. There is a positive correlation of 0.75 between
the two features, but the spread of values of nConf20 for
each value of RBC is significantly different for those molecules observed
to crystallize compared to those which are not. Histograms of the
distribution of nConf20 values in each class are plotted
in Figure , and the
distribution of the RBC descriptor is shown in Figure . Those molecules which are not observed
to crystallize tend to have a larger value of nConf20 than
those with the same RBC which are observed to crystallize, indicating
that nConf20 provides better discrimination between the
two classes than RBC. Table shows an example molecule where RBC and the new descriptor
differ significantly in their estimation of the flexibility of the
molecule. Some rotatable bonds cause no change to the molecule, especially
when there is symmetry present, information which is captured by nConf20.
Figure 2
Boxplot of the distribution of nConf20 for each value
of rotatable bond count, split by class. The central line in the box
shows the median of nConf20 for that value of RBC. The
bottom and top of the box denote the 25th and 75th quartiles, respectively.
The whiskers extend to 1.5 times the interquartile range, and any
points outside this are plotted as outliers.
Figure 3
Histogram of nConf20 for each of the two classes.
Figure 4
Histogram of rotatable bond count for each of
the two classes.
Table 2
Example
Rotatable Bond Counts and
nConf20 Values
Boxplot of the distribution of nConf20 for each value
of rotatable bond count, split by class. The central line in the box
shows the median of nConf20 for that value of RBC. The
bottom and top of the box denote the 25th and 75th quartiles, respectively.
The whiskers extend to 1.5 times the interquartile range, and any
points outside this are plotted as outliers.Histogram of nConf20 for each of the two classes.Histogram of rotatable bond count for each of
the two classes.When nConf20 is used to make a single variable
classifier
of molecules observed and not observed to crystallize, the predictive
accuracy on the external validation set is 86.1%, 7.7 percentage points
better than any other single variable (Table ). The new descriptor therefore captures
more information than any other single 2D descriptor about the likelihood
of a molecule being observed to crystallize.
Table 3
Best-Performing
Single Variable Classifiers
by 5-Fold Cross-Validation and Prediction on an External Test Seta
descriptor
cross-validation
accuracy (%)
external
validation accuracy (%)
nConf20
85.9
86.1
valence electron count
78.1
78.4
Kappa1
78.3
78.3
Chi0n
78.3
77.9
SlogP VSA2
77.7
77.8
Kappa2
77.8
77.8
Kier molecular flexibility
index
77.3
77.6
path length flexibility
index
75.2
75.2
rotatable bond count
73.9
74.8
Existing flexibility
descriptors[27] are included for comparison.
Existing flexibility
descriptors[27] are included for comparison.nConf20 was then
combined with every other descriptor
in turn to create a set of two variable classifiers, and their accuracy
was assessed by cross-validation on the training set and prediction
on the external test set. In combination with the SMR VSA3 descriptor
it produces the best two-descriptor model with a predictive accuracy
of 89.2%, as shown in Figure . SMR VSA3 is a subdivided surface area descriptor which encodes
information about the van der Waals surface area of the molecule with
a molar refractivity in the range 0.26–0.35 and has a strong
positive correlation with the number of cyclic nitrogen atoms (0.84).
The heatmap shows that while the molecules which are not observed
to crystallize have a spread of values for both descriptors, the vast
majority of molecules observed to crystallize have a value of 0 for
both descriptors. This appears to imply that molecules with no additional
conformers and no cyclic nitrogen atoms are likely to be observed
to crystallize. The black dotted line denoting the SVM decision boundary
between the two classes shows an effective separation, and the predictive
accuracy is an increase of 4.4 percentage points on any other two-variable
classifier of crystallization propensity.
Figure 5
2D histogram of nConf20 against SMR VSA3 for all test
molecules color-coded by density of molecules. The dashed line shows
the boundary between the crystalline and noncrystalline regions as
predicted by the SVM algorithm using RBF kernel.
2D histogram of nConf20 against SMR VSA3 for all test
molecules color-coded by density of molecules. The dashed line shows
the boundary between the crystalline and noncrystalline regions as
predicted by the SVM algorithm using RBF kernel.When the algorithm was trained with nConf20 and
all
177 original descriptors, the predictive accuracy improves by only
0.1% to 96.1% relative to the model with the 177 descriptors without
nConf20, suggesting that this descriptor provides information
to the model that is already indirectly captured by the other original
descriptors. However, the new descriptor captures this flexibility
information more directly, as demonstrated by the high predictive
accuracy when used in a single variable classifier. This is important
for unpicking and understanding the decisions made by the machine
learning process and will also allow it to be used more easily in
linear machine learning classifiers and decision trees, which can
become very complicated if a combination of variables is required
to predict the output.The rule extraction analysis further
supports the high importance
of this flexibility descriptor in performing the classification, as
shown in Figure .
The first node in the tree (which mimics the labels provided by the
predictive model for the training data set) provides the best initial
split of the data and therefore indicates the most important classification
feature. In this case, nConf20 is the most important feature;
the decision tree shows that the best single–decision approximation
of the SVM can be obtained by assuming that the majority of molecules
with fewer than 6 low energy conformers are observed to crystallize,
while most of those above this cutoff are assumed to not be observed
to crystallize. This agrees with the distribution shown in the histograms
in Figure . The leaves
below this node show that a single nConf20 decision alone
reproduces the SVM predictive model with an accuracy of 92% on the
crystalline leaf and 83% on the noncrystalline leaf (an overall accuracy
of 87%).
Figure 6
Decision tree used for rule extraction (top 3 levels shown). The
gini coefficient is a measure of the impurity of the node. “Samples”
indicates the percentage of the total data set present at that node,
and “value” is the proportion of “not observed
to crystallize” (orange leaves) and “observed to crystallize”
(blue leaves) molecules at the node. Each node has been assigned an
overall class based on these proportions.
Decision tree used for rule extraction (top 3 levels shown). The
gini coefficient is a measure of the impurity of the node. “Samples”
indicates the percentage of the total data set present at that node,
and “value” is the proportion of “not observed
to crystallize” (orange leaves) and “observed to crystallize”
(blue leaves) molecules at the node. Each node has been assigned an
overall class based on these proportions.
Conclusions
We have created and optimized a new descriptor,
nConf20, which captures the conformational flexibility
of a particular molecule
based on its 2D chemical connectivity. The descriptor improves on
rotatable bond count by taking account of both molecular symmetry
and relative energies of conformations, in a manner that is correlated
with the crystallization propensity of the molecule. The descriptor
encodes relevant information about the 3D shape and flexibility of
a molecule from a 2D representation, without the need to consider
the interconversion energies, as we have assumed that all contributing
conformers are energetically accessible on the time scale of crystallization.
We have shown that, of the descriptors tested, this one is the most
relevant for predicting crystallization propensity of organic molecules,
using both a single variable classifier approach and rule extraction
analysis. The overall predictive accuracy of a full-descriptor model
including this descriptor is slightly increased, suggesting that the
descriptor captures similar information to the other descriptors in
a more direct manner; however, use of this descriptor in rule-based
classification methods will reduce the complexity of the resulting
model. This descriptor has the potential to be applied to other chemical
problems where flexibility is a key factor, such as QSAR studies or
the prediction of polymorphism. Further improvements could be made,
at the expense of computational speed, by incorporating information
from molecular dynamics calculations, to take account of solvent effects
which may influence how the conformers behave in solution.
Authors: David C Kombo; Kartik Tallapragada; Rachit Jain; Joseph Chewning; Anatoly A Mazurov; Jason D Speake; Terry A Hauser; Steve Toler Journal: J Chem Inf Model Date: 2013-01-18 Impact factor: 4.956
Authors: Paul Raccuglia; Katherine C Elbert; Philip D F Adler; Casey Falk; Malia B Wenny; Aurelio Mollo; Matthias Zeller; Sorelle A Friedler; Joshua Schrier; Alexander J Norquist Journal: Nature Date: 2016-05-05 Impact factor: 49.962
Authors: M Shaheer Malik; Basim H Asghar; Riyaz Syed; Reem I Alsantali; Moataz Morad; Hatem M Altass; Ziad Moussa; Ismail I Althagafi; Rabab S Jassas; Saleh A Ahmed Journal: Front Chem Date: 2021-05-24 Impact factor: 5.221
Authors: Aikaterini Vriza; Angelos B Canaj; Rebecca Vismara; Laurence J Kershaw Cook; Troy D Manning; Michael W Gaultois; Peter A Wood; Vitaliy Kurlin; Neil Berry; Matthew S Dyer; Matthew J Rosseinsky Journal: Chem Sci Date: 2020-12-08 Impact factor: 9.825