Literature DB >> 28024401

Beyond Rotatable Bond Counts: Capturing 3D Conformational Flexibility in a Single Descriptor.

Abstract

A new molecular descriptor, nConf20, based on chemical connectivity, is presented which captures the accessible conformational space of a molecule. Currently the best available two-dimensional descriptors for quantifying the flexibility of a particular molecule are the rotatable bond count (RBC) and the Kier flexibility index. We present a descriptor which captures this information by sampling the conformational space of a molecule using the RDKit conformer generator. Flexibility has previously been identified as a key feature in determining whether a molecule is likely to crystallize or not. For this application, nConf20 significantly outperforms previously reported single-variable classifiers and also assists rule-based analysis of black-box machine learning classification algorithms.

Entities: Chemical Disease Gene

Mesh：

Year: 2016 PMID： 28024401 PMCID： PMC5271572 DOI： 10.1021/acs.jcim.6b00565

Source DB: PubMed Journal: J Chem Inf Model ISSN： 1549-9596 Impact factor: 4.956

Introduction

One of the key steps in approaching a cheminformatics problem is the definition of the “chemical space” used to describe the problem.[1] The set of numerical descriptors chosen to capture the characteristics of molecules defines the basis vectors of this space, with n linearly independent descriptors giving rise to an n-dimensional space, in which the coordinates of a particular molecule are given by the values of the descriptors for that molecule. Careful selection of these descriptors provides a useful chemical space for data visualization, similarity measures, and classification or clustering algorithms. Descriptors can be broadly categorized according to their “dimensionality”, based on the type of molecular representation used to calculate them.[2] Zero-dimensional descriptors can be calculated directly from the molecular formula, e.g. molecular weight, while one-dimensional descriptors are bulk properties of the molecule, e.g. calculated solubility.[3] Two-dimensional descriptors, such as connectivity indices and properties of the molecular bond graph, are calculated from a traditional two-dimensional representation of the molecule. Three-dimensional descriptors are computed from a known conformation of a molecule and capture features of a molecule such as shape, distribution of charges, and radius of gyration.[4] It has previously been shown that increased flexibility can reduce the crystallization tendency of a molecule[5] and that rotatable bond count (RBC), a 2D descriptor based on a set of SMARTS pattern matching rules, was an important feature of molecules for the determination of how easily a molecular material can be crystallized.[6] This may be due to the reduced effective concentration of the “correct” crystallizing conformer in solution for a molecule with more rotatable bonds.[5] However, the exact mechanisms of nucleation and growth of crystals and the influence of molecular conformation are still not fully understood.[7,8] RBC is quite a crude approximation of molecular flexibility; Kier devised a way of encoding this attribute based on the chemical graph, but this uses descriptors which are also based solely on two-dimensional information.[9] Other studies of conformational flexibility have been computationally expensive because they attempt to evaluate the entire potential energy hypersurface,[10,11] which is impractical for a large number of molecules. They also either do not yield a single value which can be correlated with a physical property of the molecule or are only appropriate for use with a specific subset of chemical space, such as alkanes.[12,13] A more direct measure could be obtained by designing a descriptor based on a sampling of the energetically accessible conformers of a given molecule. This approach is relatively quick to compute as it only involves finding the minima of the potential energy surface, and it falls into the 2D category of descriptors outlined above since its value depends only on the chemical connectivity of a molecule but will capture 3D information about the number of low energy conformers of that molecule. This eliminates the computation of the barriers to interconversion and assumes that all conformers are energetically accessible on the time scales of crystallization. Implementation of this new “3D from 2D” descriptor and its application to the problem of predicting if a molecule will be observed to crystallize is described herein.

Methods

Conformer Generation

Molecules were provided to the conformer-generation step as SMILES strings to ensure no residual conformational information was retained, and explicit hydrogen atoms were added to the skeleton. RDKit cheminformatics toolkit[14] functions were used to generate 50 random molecular conformations, while retaining the starting stereochemistry. RDKit was chosen over other open-source conformer generation tools like BALLOON, CONFAB, and FROG2 and commercial platforms such as MOE, due to speed and the ability to generate conformers which are structurally similar to experimentally determined structures.[15] A knowledge-based conformer generator which uses experimental observations of torsional angle distributions is available in the latest release of the Cambridge Structural Database (CSD) tools.[16] These alternatives have not been explored in this work but could potentially be used to sample conformational space in a similar manner to RDKit. Each randomly generated conformer was optimized using the Merck Molecular Force Field (MMFF94).[17] MMFF94 is a general purpose parametrized force field comprised of several well-defined contributions to the total potential energy of a molecule, including bond stretching energy, bond torsion, and electrostatic and van der Waals interaction energies. The force field parametrization is determined by training on a large set of computational data derived from ab initio calculations on a diverse range of organic and bioorganic structures and has been implemented within the RDKit.[18] Some other force fields suitable for organic molecules include Amber, Gaff, and CHARMM.[19] The Universal Force Field (UFF) can be used to compute energies and gradients of molecules containing almost any element and may therefore prove useful if extending this work to metal–organic complexes or inorganic molecular materials. MMFF94 has been shown to reproduce gas-phase conformer energies more accurately than these other widely available force fields[20] and was chosen for its significantly shorter computational time compared to a more accurate molecular dynamics calculation including solvent effects. If the optimization did not converge to a stable minimum the conformer was removed. The force field is then used to calculate the energy of each conformer; its energy relative to the lowest-energy conformer found is stored. The lowest energy conformer is retained, and for each other conformer the alignment of all permutations of matching atom orders with the other conformers is checked, to account for symmetry. Any duplicate conformers with a heavy atom root-mean-square (RMS) distance of less than 1.0 Å to any other conformer are removed. For the small minority (0.05%) of molecules where the MMFF optimization failed, the molecule was removed from the study. The entire calculation of the energies takes around 0.2 s for molecules with fewer than two rotatable bonds, 1–2 s for molecules with 4 or 5 rotatable bonds, and up to 5 s for molecules with 8 rotatable bonds.

Predictive Model

CSD molecules were obtained from crystal structures in the 2016 CSD release (version 5.37), while lists of commercially available molecules were obtained from ZINC15 downloaded in May 2016. 177 molecular descriptors were calculated using the RDKit cheminformatics toolkit,[14] version Q1 2016. Machine learning algorithms and performance metrics were implemented using version 17.0 of the scikit-learn package.[21] The descriptor definitions and an example of the method used to train a model and output a predictive accuracy from a set of training and test molecules with known labels are given in the Supporting Information of Wicker and Cooper (see http://www.rsc.org/suppdata/ce/c4/c4ce01912a/c4ce01912a1.pdf).[6] Training and test molecules were selected as in a previous study[6] using information extracted from the ZINC[22] database and the CSD.[16] In this instance, no drug-like filter was applied, to include all organic molecules, which resulted in a set of 48112 commercially available molecules of which 36083 were used for training and 12029 were reserved for a test set. Half of each set consisted of “observed to crystallize” molecules (found in both ZINC and the CSD) and the other half consisted of “not observed to crystallize” molecules (found only in ZINC). Support vector machines (SVMs) were used as the machine learning algorithm to create the predictive model using the molecular descriptors, having previously been found to give the best performance for this classification problem.[6]

Rule Extraction

The “black-box” nature of the nonlinear SVM predictive model prevents direct determination of the most important descriptors used in performing the classification.[23] Two methods were used to identify these descriptors.

Single Variable Classifiers

The descriptors which are calculated by RDKit were each used in turn to create a single variable classifier built in order to find the descriptor which gave the best predictive accuracy and therefore the most effective classification.[24] The accuracy was assessed both by 5-fold cross-validation on the training set and by prediction on an external validation set. This approach can be extended to two (or more) variable classifiers.

Decision Tree

Rule extraction techniques can be used to mimic the SVM model as closely as possible in order to infer how it is performing the classification.[25] This approach has been used to extract simple rules from machine-learning models which were trained to classify reaction outcomes.[26] The SVM algorithm is trained as usual on the training data set, and the resulting model is then used to obtain the predicted labels for the training data. A conventional decision tree classifier is then trained on the predicted labels to represent the SVM predictive model in terms of a rule-based decision tree.

Descriptor

A new single value descriptor was developed based on the distribution of relative conformer energies. The new descriptor is a count of additional conformers (not including the lowest energy conformer) with energies between selected relative energy thresholds and is designed to approximate the number of energetically accessible conformations of a molecule. In order to find the optimal energy thresholds for the descriptor, a 5-fold cross-validation was carried out on the training set using the descriptor to create a single variable classifier. Figure shows the distribution of accuracies, which has a broad maximum between an upper threshold of 16 to 20 kcal/mol and a lower energy threshold of 0 to 1 kcal/mol, with no significant difference between the predictive accuracies. This led to a choice of 0 as the lower threshold and 20 as the upper threshold. An example of calculating this descriptor using a 20 kcal/mol cutoff is given in Table .

Figure 1

Predictive accuracies for the conformer energy descriptor with varying limits, as determined by 5-fold cross-validation.

Table 1

Example nConf20 Calculation for CSD Refcode TERLUXa

The lowest energy conformer is not counted, and conformer 41 is above the energy threshold, giving an nConf20 value of 4.

The lowest energy conformer is not counted, and conformer 41 is above the energy threshold, giving an nConf20 value of 4. Predictive accuracies for the conformer energy descriptor with varying limits, as determined by 5-fold cross-validation.

Results and Discussion

Figure shows that rotatable bond count and nConf20 capture similar but slightly different information. There is a positive correlation of 0.75 between the two features, but the spread of values of nConf20 for each value of RBC is significantly different for those molecules observed to crystallize compared to those which are not. Histograms of the distribution of nConf20 values in each class are plotted in Figure , and the distribution of the RBC descriptor is shown in Figure . Those molecules which are not observed to crystallize tend to have a larger value of nConf20 than those with the same RBC which are observed to crystallize, indicating that nConf20 provides better discrimination between the two classes than RBC. Table shows an example molecule where RBC and the new descriptor differ significantly in their estimation of the flexibility of the molecule. Some rotatable bonds cause no change to the molecule, especially when there is symmetry present, information which is captured by nConf20.

Figure 2

Figure 3

Histogram of nConf20 for each of the two classes.

Figure 4

Histogram of rotatable bond count for each of the two classes.

Table 2

Example Rotatable Bond Counts and nConf20 Values

Boxplot of the distribution of nConf20 for each value of rotatable bond count, split by class. The central line in the box shows the median of nConf20 for that value of RBC. The bottom and top of the box denote the 25th and 75th quartiles, respectively. The whiskers extend to 1.5 times the interquartile range, and any points outside this are plotted as outliers. Histogram of nConf20 for each of the two classes. Histogram of rotatable bond count for each of the two classes. When nConf20 is used to make a single variable classifier of molecules observed and not observed to crystallize, the predictive accuracy on the external validation set is 86.1%, 7.7 percentage points better than any other single variable (Table ). The new descriptor therefore captures more information than any other single 2D descriptor about the likelihood of a molecule being observed to crystallize.

Table 3

Best-Performing Single Variable Classifiers by 5-Fold Cross-Validation and Prediction on an External Test Seta

descriptor	cross-validation accuracy (%)	external validation accuracy (%)
nConf₂₀	85.9	86.1
valence electron count	78.1	78.4
Kappa1	78.3	78.3
Chi0n	78.3	77.9
SlogP VSA2	77.7	77.8
Kappa2	77.8	77.8
Kier molecular flexibility index	77.3	77.6
path length flexibility index	75.2	75.2
rotatable bond count	73.9	74.8

Existing flexibility descriptors[27] are included for comparison.

Existing flexibility descriptors[27] are included for comparison. nConf20 was then combined with every other descriptor in turn to create a set of two variable classifiers, and their accuracy was assessed by cross-validation on the training set and prediction on the external test set. In combination with the SMR VSA3 descriptor it produces the best two-descriptor model with a predictive accuracy of 89.2%, as shown in Figure . SMR VSA3 is a subdivided surface area descriptor which encodes information about the van der Waals surface area of the molecule with a molar refractivity in the range 0.26–0.35 and has a strong positive correlation with the number of cyclic nitrogen atoms (0.84). The heatmap shows that while the molecules which are not observed to crystallize have a spread of values for both descriptors, the vast majority of molecules observed to crystallize have a value of 0 for both descriptors. This appears to imply that molecules with no additional conformers and no cyclic nitrogen atoms are likely to be observed to crystallize. The black dotted line denoting the SVM decision boundary between the two classes shows an effective separation, and the predictive accuracy is an increase of 4.4 percentage points on any other two-variable classifier of crystallization propensity.

Figure 5

2D histogram of nConf20 against SMR VSA3 for all test molecules color-coded by density of molecules. The dashed line shows the boundary between the crystalline and noncrystalline regions as predicted by the SVM algorithm using RBF kernel. When the algorithm was trained with nConf20 and all 177 original descriptors, the predictive accuracy improves by only 0.1% to 96.1% relative to the model with the 177 descriptors without nConf20, suggesting that this descriptor provides information to the model that is already indirectly captured by the other original descriptors. However, the new descriptor captures this flexibility information more directly, as demonstrated by the high predictive accuracy when used in a single variable classifier. This is important for unpicking and understanding the decisions made by the machine learning process and will also allow it to be used more easily in linear machine learning classifiers and decision trees, which can become very complicated if a combination of variables is required to predict the output. The rule extraction analysis further supports the high importance of this flexibility descriptor in performing the classification, as shown in Figure . The first node in the tree (which mimics the labels provided by the predictive model for the training data set) provides the best initial split of the data and therefore indicates the most important classification feature. In this case, nConf20 is the most important feature; the decision tree shows that the best single–decision approximation of the SVM can be obtained by assuming that the majority of molecules with fewer than 6 low energy conformers are observed to crystallize, while most of those above this cutoff are assumed to not be observed to crystallize. This agrees with the distribution shown in the histograms in Figure . The leaves below this node show that a single nConf20 decision alone reproduces the SVM predictive model with an accuracy of 92% on the crystalline leaf and 83% on the noncrystalline leaf (an overall accuracy of 87%).

Figure 6

Decision tree used for rule extraction (top 3 levels shown). The gini coefficient is a measure of the impurity of the node. “Samples” indicates the percentage of the total data set present at that node, and “value” is the proportion of “not observed to crystallize” (orange leaves) and “observed to crystallize” (blue leaves) molecules at the node. Each node has been assigned an overall class based on these proportions.

Conclusions

We have created and optimized a new descriptor, nConf20, which captures the conformational flexibility of a particular molecule based on its 2D chemical connectivity. The descriptor improves on rotatable bond count by taking account of both molecular symmetry and relative energies of conformations, in a manner that is correlated with the crystallization propensity of the molecule. The descriptor encodes relevant information about the 3D shape and flexibility of a molecule from a 2D representation, without the need to consider the interconversion energies, as we have assumed that all contributing conformers are energetically accessible on the time scale of crystallization. We have shown that, of the descriptors tested, this one is the most relevant for predicting crystallization propensity of organic molecules, using both a single variable classifier approach and rule extraction analysis. The overall predictive accuracy of a full-descriptor model including this descriptor is slightly increased, suggesting that the descriptor captures similar information to the other descriptors in a more direct manner; however, use of this descriptor in rule-based classification methods will reduce the complexity of the resulting model. This descriptor has the potential to be applied to other chemical problems where flexibility is a key factor, such as QSAR studies or the prediction of polymorphism. Further improvements could be made, at the expense of computational speed, by incorporating information from molecular dynamics calculations, to take account of solvent effects which may influence how the conformers behave in solution.

10 in total

1. Selected concepts and investigations in compound classification, molecular descriptor analysis, and virtual screening.

Authors: J Bajorath
Journal: J Chem Inf Comput Sci Date: 2001 Mar-Apr

2. Freely available conformer generation methods: how good are they?

Authors: Jean-Paul Ebejer; Garrett M Morris; Charlotte M Deane
Journal: J Chem Inf Model Date: 2012-04-19 Impact factor: 4.956

Review 3. Exploring chemical space: recent advances in chemistry.

Authors: Yung-Sing Wong
Journal: Methods Mol Biol Date: 2012

4. Development of a chirality-sensitive flexibility descriptor for 3+3D-QSAR.

Authors: Maté Dervarics; Ferenc Otvös; Tamas A Martinek
Journal: J Chem Inf Model Date: 2006 May-Jun Impact factor: 4.956

5. Conformational polymorphism.

Authors: Aurora J Cruz-Cabeza; Joel Bernstein
Journal: Chem Rev Date: 2013-12-18 Impact factor: 60.622

Review 6. Travelling through conformational space: an approach for analyzing the conformational behaviour of flexible molecules.

Authors: J Koca
Journal: Prog Biophys Mol Biol Date: 1998 Impact factor: 3.667

7. 3D molecular descriptors important for clinical success.

Authors: David C Kombo; Kartik Tallapragada; Rachit Jain; Joseph Chewning; Anatoly A Mazurov; Jason D Speake; Terry A Hauser; Steve Toler
Journal: J Chem Inf Model Date: 2013-01-18 Impact factor: 4.956

8. Machine-learning-assisted materials discovery using failed experiments.

Authors: Paul Raccuglia; Katherine C Elbert; Philip D F Adler; Casey Falk; Malia B Wenny; Aurelio Mollo; Matthias Zeller; Sorelle A Friedler; Joshua Schrier; Alexander J Norquist
Journal: Nature Date: 2016-05-05 Impact factor: 49.962

9. ZINC 15--Ligand Discovery for Everyone.

Authors: Teague Sterling; John J Irwin
Journal: J Chem Inf Model Date: 2015-11-09 Impact factor: 4.956

10. The Cambridge Structural Database.

Authors: Colin R Groom; Ian J Bruno; Matthew P Lightfoot; Suzanna C Ward
Journal: Acta Crystallogr B Struct Sci Cryst Eng Mater Date: 2016-04-01

10 in total

5 in total

1. Privileged substructures for anti-sickling activity via cheminformatic analysis.

Authors: Chuleeporn Phanus-Umporn; Watshara Shoombuatong; Veda Prachayasittikul; Nuttapat Anuwongcharoen; Chanin Nantasenamat
Journal: RSC Adv Date: 2018-02-07 Impact factor: 4.036

2. Novel Pyran-Linked Phthalazinone-Pyrazole Hybrids: Synthesis, Cytotoxicity Evaluation, Molecular Modeling, and Descriptor Studies.

Authors: M Shaheer Malik; Basim H Asghar; Riyaz Syed; Reem I Alsantali; Moataz Morad; Hatem M Altass; Ziad Moussa; Ismail I Althagafi; Rabab S Jassas; Saleh A Ahmed
Journal: Front Chem Date: 2021-05-24 Impact factor: 5.221

3. In Silico Study Approach on a Series of 50 Polyphenolic Compounds in Plants; A Comparison on the Bioavailability and Bioactivity Data.

Authors: Amalia Stefaniu; Lucia Camelia Pirvu
Journal: Molecules Date: 2022-02-19 Impact factor: 4.411

4. One class classification as a practical approach for accelerating π-π co-crystal discovery.

Authors: Aikaterini Vriza; Angelos B Canaj; Rebecca Vismara; Laurence J Kershaw Cook; Troy D Manning; Michael W Gaultois; Peter A Wood; Vitaliy Kurlin; Neil Berry; Matthew S Dyer; Matthew J Rosseinsky
Journal: Chem Sci Date: 2020-12-08 Impact factor: 9.825

5. Discovering Relationships between OSDAs and Zeolites through Data Mining and Generative Neural Networks.

Authors: Zach Jensen; Soonhyoung Kwon; Daniel Schwalbe-Koda; Cecilia Paris; Rafael Gómez-Bombarelli; Yuriy Román-Leshkov; Avelino Corma; Manuel Moliner; Elsa A Olivetti
Journal: ACS Cent Sci Date: 2021-04-16 Impact factor: 14.553

5 in total