| Literature DB >> 23849655 |
Xiaohui Qu1, Diogo Ars Latino2, Joao Aires-de-Sousa1.
Abstract
BACKGROUND: The rapid access to intrinsic physicochemical properties of molecules is highly desired for large scale chemical data mining explorations such as mass spectrum prediction in metabolomics, toxicity risk assessment and drug discovery. Large volumes of data are being produced by quantum chemistry calculations, which provide increasing accurate estimations of several properties, e.g. by Density Functional Theory (DFT), but are still too computationally expensive for those large scale uses. This work explores the possibility of using large amounts of data generated by DFT methods for thousands of molecular structures, extracting relevant molecular properties and applying machine learning (ML) algorithms to learn from the data. Once trained, these ML models can be applied to new structures to produce ultra-fast predictions. An approach is presented for homolytic bond dissociation energy (BDE).Entities:
Keywords: BDE; Big data; Bond dissociation energy; Chemoinformatics; DFT; DFTB; Machine learning; Neural network; Random forest
Year: 2013 PMID: 23849655 PMCID: PMC3720218 DOI: 10.1186/1758-2946-5-34
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Figure 1Examples of bond descriptors.
Random Forest prediction of bond dissociation energies with different selections of descriptors obtained in the out-of-bag (OOB) validation procedure over the training set
| Connection Number | 3675 | 6.87 | 4.25 | 71.58 |
| Connection Number (only point descriptors) | 112 | 7.50 | 4.77 | 79.55 |
| Simple Element (Selection 1) | 615 | 8.90 | 5.41 | 82.53 |
| Selection of connection number, element, and no-type (Selection 2) | 293 | 7.01 | 4.35 | 56.87 |
a Results are in kcal/mol.
Impact of individual groups of descriptors on random forest prediction of bond dissociation energies
| Selection 2 b | 293 | 7.01 | 4.35 | 56.87 |
| Selection 2 - CN point | 209 | 9.06 | 5.50 | 80.97 |
| Selection 2 - Element pair | 218 | 7.18 | 4.48 | 57.13 |
| Selection 2 - Fragment point | 288 | 7.02 | 4.36 | 58.75 |
| Selection 2 - Aromatic fragment point | 279 | 7.07 | 4.39 | 62.14 |
| Selection 2 - In-ring fragment point | 281 | 7.03 | 4.36 | 55.31 |
| Selection 2 - No-type pair | 269 | 7.06 | 4.43 | 58.95 |
| Selection 2 - No-type bond-breaking difference pair | 284 | 6.96 | 4.32 | 57.70 |
| Selection 2 - π fragment point | 281 | 7.03 | 4.37 | 58.59 |
| Selection 2 - Molecular element pair | 263 | 6.96 | 4.31 | 57.86 |
| Selection 2 - Molecular CN fragment point | 265 | 7.04 | 4.35 | 57.77 |
| Selection 3 c | 209 | 7.00 | 4.32 | 58.36 |
a Results are in kcal/mol, and were obtained in the out-of-Bag (OOB) RF validation procedure over the training set.
b Combination of the following descriptors: 1) Connection number point descriptors, 2) Simple element pair descriptors, 3) Fragment point descriptors (only the field with the lower value is used), 4) Aromatic fragment point descriptors with corresponding molecular descriptors, 5) In-ring fragment point descriptors, 6) No-type pair descriptors, 7) No-type bond-breaking difference pair descriptors, 8) Conjugated π system fragment point descriptors with corresponding molecular descriptors, 9) Simple element molecular pair descriptors, 10) Molecular connection number fragment point descriptors. Descriptors 1, 4, 5, are calculated for spheres from 0 to 5. Descriptors 2, 3, 6, 8 are calculated for spheres from 1 to 5. Descriptor 2 only involves pairs at a distance of one bond. Descriptor 7 involves pairs with interatomic distances from 2 to 7. Descriptor 9 involves pairs with interatomic distances of 1–2.
c Combination of Descriptors 1, 2, 4, 6, 8.
Random forest prediction of BDEs with descriptors encoding different spheres around the target bond
| 1 | 31 | 11.97 | 7.91 | 121.82 |
| 2 | 60 | 7.95 | 5.22 | 56.00 |
| 3 | 94 | 7.21 | 4.59 | 55.62 |
| 4 | 129 | 7.10 | 4.46 | 59.41 |
| 5 | 165 | 7.04 | 4.38 | 60.22 |
| 6 | 201 | 7.00 | 4.33 | 57.67 |
| 7 | 240 | 7.01 | 4.32 | 57.33 |
| 8 | 278 | 7.01 | 4.34 | 57.59 |
| 9 | 316 | 7.03 | 4.35 | 57.41 |
| 10 | 354 | 7.04 | 4.35 | 57.20 |
Descriptors selected by random forest to predict bond dissociation energies
| CN point desc. | 56 | 39 |
| Element pair desc. | 45 | 23 |
| Aromatic fragment point desc. | 10 | 10 |
| No-type pair desc. | 15 | 15 |
| π Total number desc. | 3 | 3 |
Figure 2Distribution of ASNN BDEs errors in the test set.
Accuracy of dissociation energies predicted by ASNN, RF and calculated by PM6 (against B3LYP-calculated BDEs) for different types of bonds
| C-C | 21.38 | 49.57 | 19.27 | 0.806 | 6.64 | 23.83 | 4.69 | 0.907 | 5.47 | 22.95 | 3.72 | 0.938 |
| C-H | 18.06 | 35.48 | 17.15 | 0.869 | 3.77 | 19.79 | 2.34 | 0.942 | 3.68 | 22.64 | 2.20 | 0.944 |
| C-N | 17.30 | 39.60 | 15.10 | 0.919 | 8.52 | 26.15 | 5.87 | 0.928 | 6.70 | 17.33 | 5.02 | 0.955 |
| C-O | 13.97 | 24.54 | 12.45 | 0.969 | 6.73 | 21.10 | 4.96 | 0.971 | 5.42 | 16.67 | 3.87 | 0.979 |
| C-S | 7.78 | 14.83 | 6.42 | 0.881 | 5.99 | 14.30 | 4.53 | 0.887 | 4.29 | 10.46 | 3.20 | 0.950 |
| N-H | 12.04 | 20.45 | 11.03 | 0.799 | 8.39 | 37.07 | 5.47 | 0.576 | 8.31 | 34.99 | 5.19 | 0.586 |
| O-H | 10.73 | 15.45 | 9.58 | 0.975 | 12.35 | 24.78 | 9.02 | 0.704 | 12.46 | 22.54 | 9.53 | 0.586 |
| N-N | 13.06 | 22.99 | 10.21 | 0.779 | 13.03 | 21.95 | 12.14 | 0.514 | 10.10 | 15.12 | 8.47 | 0.736 |
| N-O | 10.51 | 18.85 | 8.72 | 0.635 | 5.01 | 9.11 | 3.89 | 0.912 | 7.58 | 17.63 | 5.07 | 0.827 |
| S-O | 25.90 | 26.04 | 25.90 | 1.000 | 0.69 | 0.78 | 0.69 | 1.000 | 1.94 | 2.01 | 1.94 | 1.000 |
a 787 bonds were used for the comparison, from the molecules of the test set that could be calculated with PM6.
b Each line corresponds to a type of bond between atoms of specific elements (all bond orders included).
Figure 3ASNN and PM6 predictions of BDEs versus B3LYP calculations.
Comparison between BDEs predicted by ASNN and BDEs predicted as the average of BDEs for the bonds of the same type in the training set
| | |||||||
|---|---|---|---|---|---|---|---|
| C-C | 21.64 | 62.88 | 16.56 | 5.47 | 22.95 | 3.72 | 0.938 |
| C-H | 15.48 | 58.64 | 12.94 | 3.68 | 22.64 | 2.20 | 0.944 |
| C-N | 20.57 | 65.28 | 15.65 | 6.70 | 17.33 | 5.02 | 0.955 |
| C-O | 14.18 | 46.63 | 10.87 | 5.42 | 16.67 | 3.87 | 0.979 |
| C-S | 11.03 | 23.36 | 8.85 | 4.29 | 10.46 | 3.20 | 0.950 |
| H-N | 13.20 | 33.17 | 10.17 | 8.31 | 34.99 | 5.19 | 0.586 |
| H-O | 22.65 | 44.56 | 15.62 | 12.46 | 22.54 | 9.53 | 0.586 |
| N-N | 21.81 | 42.80 | 18.12 | 10.10 | 15.12 | 8.47 | 0.736 |
| N-O | 8.65 | 15.21 | 7.44 | 7.58 | 17.63 | 5.07 | 0.827 |
| O-S | 1.85 | 2.55 | 1.58 | 1.94 | 2.01 | 1.94 | 1.000 |
a Each line corresponds to a type of bond between atoms of specific elements and of a specific order.
Correlations between BDEs calculated at different levels of theory and estimations by PM6 and ASNN
| | ||||||||
|---|---|---|---|---|---|---|---|---|
| B3LYP/6-311++G(d,p)//DFTB | 3.04 | 1.82 | 21.41 | 0.985 | | | | |
| PM6 | 16.88 | 15.46 | 46.03 | 0.901 | 17.52 | 15.98 | 49.57 | 0.890 |
| ASNN | 5.18 | 3.38 | 33.58 | 0.956 | 5.16 | 3.21 | 34.99 | 0.954 |
a 787 bonds were used for the comparison, from the molecules of the test set that could be calculated with all DFT and PM6 methods.
Figure 4Example of a symmetrical molecule with different BDEs for two topologically equivalent bonds due to hydrogen bonding.
Figure 5Examples of two bonds with identical descriptors in the final model, but very different BDEs due to remote group effects.