| Literature DB >> 29899464 |
Daniel C Elton1, Zois Boukouvalas2, Mark S Butrico2, Mark D Fuge2, Peter W Chung3.
Abstract
We present a proof of concept that machine learning techniques can be used to predict the properties of CNOHF energetic molecules from their molecular structures. We focus on a small but diverse dataset consisting of 109 molecular structures spread across ten compound classes. Up until now, candidate molecules for energetic materials have been screened using predictions from expensive quantum simulations and thermochemical codes. We present a comprehensive comparison of machine learning models and several molecular featurization methods - sum over bonds, custom descriptors, Coulomb matrices, Bag of Bonds, and fingerprints. The best featurization was sum over bonds (bond counting), and the best model was kernel ridge regression. Despite having a small data set, we obtain acceptable errors and Pearson correlations for the prediction of detonation pressure, detonation velocity, explosive energy, heat of formation, density, and other properties out of sample. By including another dataset with ≈300 additional molecules in our training we show how the error can be pushed lower, although the convergence with number of molecules is slow. Our work paves the way for future applications of machine learning in this domain, including automated lead generation and interpreting machine learning models to obtain novel chemical insights.Entities:
Year: 2018 PMID: 29899464 PMCID: PMC5998124 DOI: 10.1038/s41598-018-27344-x
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Transformation of (x, y, z) coordinates and nuclear charges to the Coulomb matrix eigenvalue spectra representation of a molecule.
Figure 2Mean errors in explosive energy obtained with different fingerprinting methods at different fingerprint lengths, using Bayesian ridge regression and leave-5-out cross validation. The E-state fingerprint has a fixed length and therefore appears as a flat line.
Detailed comparison of 13 different featurization schemes for prediction of explosive energy with kernel ridge regression, ranked by MAEtest. These variance in MAEs between folds was less than 0.01 in all cases. Hyperparameter optimization was used throughout with nested 5-fold cross validation. The metrics are averaged over 20 train-test sets using shuffle split with 80/20 splitting.
| name | MAEtrain | MAEtest | MAPEtest |
|
| rtrain | rtest |
|---|---|---|---|---|---|---|---|
| E-state + CDS + SoB | 0.244 | 0.334 | 8.93 | 0.88 | 0.76 | 0.88 | 0.79 |
| CDS + SoB | 0.247 | 0.335 | 9.32 | 0.88 | 0.75 | 0.88 | 0.79 |
| E-state + custom descriptor set | 0.224 | 0.345 | 9.50 | 0.89 | 0.75 | 0.90 | 0.79 |
| SoB + OB100 | 0.256 | 0.358 | 10.50 | 0.87 | 0.61 | 0.87 | 0.70 |
| sum over bonds (SoB) | 0.280 | 0.379 | 10.69 | 0.84 | 0.67 | 0.84 | 0.71 |
| truncated E-state | 0.260 | 0.414 | 12.65 | 0.85 | 0.66 | 0.85 | 0.70 |
| custom descriptor set (CDS) | 0.398 | 0.432 | 12.92 | 0.68 | 0.57 | 0.68 | 0.63 |
| Bag of Bonds (BoB) | 0.213 | 0.467 | 12.60 | 0.89 | 0.54 | 0.90 | 0.60 |
| Oxygen balance1600 | 0.419 | 0.489 | 15.66 | 0.67 | 0.41 | 0.68 | 0.56 |
| Summed Bag of Bonds | 0.262 | 0.493 | 13.63 | 0.85 | 0.18 | 0.85 | 0.56 |
| Coulomb matrix eigenvalues | 0.314 | 0.536 | 15.73 | 0.81 | 0.37 | 0.82 | 0.48 |
| Oxygen balance100 | 0.444 | 0.543 | 17.46 | 0.59 | 0.44 | 0.62 | 0.57 |
| Coulomb matrices as vec | 0.395 | 0.672 | 21.86 | 0.57 | 0.05 | 0.67 | 0.20 |
Average mean absolute errors (MAEs) in the test sets for different combinations of target property, model and featurization. Hyperparameter optimization was used throughout with nested 5-fold cross validation. The test MAEs are averaged over 20 test sets using shuffle split with 80/20 splitting. The properties are density, heat of formation of the solid, explosive energy, shock velocity, particle velocity, sound velocity, detonation pressure, detonation temperature, and TNT equivalent per cubic centimeter. The models are kernel ridge regression (KRR), ridge regression (Ridge), support vector regression (SVR), random forest (RF), k-nearest neighbors (kNN), and a take-the-mean dummy predictor. The last row gives the average value for each property in the dataset.
|
|
|
|
|
|
|
| ||||
|---|---|---|---|---|---|---|---|---|---|---|
| KRR | Estate | 0.10 | 261.02 | 0.63 | 0.48 | 0.13 | 0.41 | 4.95 | 500.19 | 0.18 |
| CDS | 0.08 | 198.81 | 0.50 | 0.44 | 0.11 | 0.37 | 3.07 | 462.63 | 0.17 | |
| SoB | 0.07 | 68.73 | 0.40 |
|
|
| 2.90 | 331.36 |
| |
| CM eigs | 0.09 | 288.41 | 0.67 | 0.67 | 0.18 | 0.61 | 5.67 | 600.08 | 0.22 | |
| Bag of Bonds |
| 166.66 | 0.47 | 0.33 | 0.11 | 0.29 | 3.38 | 478.93 | 0.18 | |
| Estate + CDS + SoB |
|
|
| 0.32 | 0.10 | 0.29 | 2.76 | 359.66 | 0.13 | |
| Ridge | Estate | 0.09 | 269.11 | 0.58 | 0.57 | 0.14 | 0.45 | 4.71 | 491.21 | 0.19 |
| CDS | 0.07 | 193.19 | 0.43 | 0.39 | 0.11 | 0.33 | 3.23 | 438.27 | 0.17 | |
| SoB |
| 82.00 | 0.37 | 0.32 | 0.10 | 0.29 | 3.01 |
|
| |
| CM eigs | 0.09 | 355.12 | 0.79 | 0.60 | 0.16 | 0.55 | 5.82 | 590.69 | 0.19 | |
| Bag of Bonds | 0.06 | 163.76 | 0.48 | 0.32 | 0.11 | 0.31 | 3.37 | 472.93 | 0.19 | |
| Estate + CDS + SoB | 0.06 | 77.31 | 0.39 | 0.32 | 0.10 | 0.28 | 2.78 | 383.07 | 0.13 | |
| SVR | Estate | 0.09 | 207.78 | 0.60 | 0.45 | 0.13 | 0.35 | 4.41 | 476.06 | 0.17 |
| CDS | 0.07 | 223.24 | 0.52 | 0.34 | 0.12 | 0.32 | 3.21 | 436.81 | 0.18 | |
| SoB | 0.06 | 130.78 | 0.40 |
| 0.10 | 0.28 | 2.97 | 331.27 | 0.14 | |
| CM eigs | 0.08 | 288.41 | 0.55 | 0.60 | 0.15 | 0.53 | 4.54 | 584.44 | 0.21 | |
| Bag of Bonds | 0.07 | 159.24 | 0.47 | 0.35 | 0.12 | 0.28 | 3.34 | 385.59 | 0.18 | |
| Estate + CDS + SoB | 0.06 | 129.89 | 0.37 | 0.34 | 0.10 | 0.28 |
| 353.18 | 0.13 | |
| RF | Estate | 0.09 | 252.74 | 0.59 | 0.50 | 0.14 | 0.39 | 4.09 | 488.98 | 0.19 |
| CDS | 0.07 | 241.67 | 0.46 | 0.36 | 0.11 | 0.29 | 3.34 | 435.77 | 0.16 | |
| SoB | 0.07 | 136.91 | 0.48 | 0.40 | 0.12 | 0.30 | 3.47 | 417.46 | 0.15 | |
| CM eigs | 0.09 | 286.89 | 0.67 | 0.62 | 0.15 | 0.51 | 5.52 | 512.22 | 0.20 | |
| Bag of Bonds | 0.07 | 172.41 | 0.46 | 0.36 | 0.10 | 0.29 | 3.10 | 418.35 | 0.16 | |
| Estate + CDS + SoB | 0.07 | 144.18 | 0.43 | 0.34 |
| 0.26 | 3.11 | 401.27 | 0.15 | |
| kNN | Estate | 0.08 | 236.55 | 0.61 | 0.49 | 0.15 | 0.41 | 4.30 | 563.89 | 0.20 |
| CDS | 0.07 | 242.99 | 0.55 | 0.39 | 0.13 | 0.33 | 3.56 | 478.50 | 0.18 | |
| SoB | 0.08 | 184.43 | 0.54 | 0.44 | 0.12 | 0.36 | 3.65 | 427.20 | 0.17 | |
| CM eigs | 0.10 | 343.48 | 0.62 | 0.67 | 0.15 | 0.51 | 5.52 | 570.55 | 0.22 | |
| Bag of Bonds | 0.08 | 238.05 | 0.53 | 0.40 | 0.11 | 0.32 | 3.58 | 515.25 | 0.19 | |
| Estate + CDS + SoB | 0.08 | 171.65 | 0.54 | 0.43 | 0.12 | 0.35 | 3.57 | 442.14 | 0.17 | |
| mean | n/a | 0.11 | 309.75 | 0.69 | 0.65 | 0.15 | 0.55 | 4.88 | 629.20 | 0.22 |
| 1.86 | 0.50 | 3.93 | 8.47 | 2.04 | 6.43 | 32.13 | 3568.65 | 1.43 | ||
Mean absolute errors and Pearson correlation coefficients for ML on the dataset of Ravi et al.[34], which contains 25 nitropyrazole molecules. 5-fold cross validation was used, so Ntrain = 20 and Ntest = 5.
| KRR | Estate | 0.12, 0.99 | 0.04, 0.98 | 199, 0.92 |
| CDS |
| 0.03, 0.99 | 1.10, 0.99 | |
| SoB | 0.08, 0.99 | 0.03, 0.99 | ||
| Ridge | Estate | 0.32, 0.91 | 0.03, 0.98 | 2.48, 0.98 |
| CDS | 0.14, 0.99 | |||
| SoB | 0.44, 0.86 | 0.03, 0.99 | 2.92, 0.96 | |
| mean | n/a | 1.25, 0.00 | 0.27, 0.00 | 12.90, 0.00 |
Figure 3Residuals in leave-one-out cross validation with kernel ridge regression and the sum over bonds featurization (left), and some of the worst performing molecules (right).
The mean absolute error, Pearson correlation and average residual in different groups, for prediction of explosive energy (kJ/cc) with leave-one-out CV on the entire dataset using sum over bonds and kernel ridge regression. The groups are sorted by r value rather than MAE since the average explosive energy differs significantly between groups.
| group | Ngroup |
| R2 | MAE | avg. residual (kJ/cc) |
|---|---|---|---|---|---|
| HMX | 6 | 0.968 | 0.83 | 0.32 | −0.06 |
| Butterfly | 10 | 0.917 | 0.78 | 0.18 | −0.01 |
| TNT | 16 | 0.854 | 0.83 | 0.31 | 0.10 |
| CL20 | 6 | 0.854 | 0.83 | 0.22 | −0.09 |
| Cubane | 12 | 0.814 | 0.75 | 0.58 | −0.04 |
| Ring | 8 | 0.548 | 0.17 | 0.26 | 0.20 |
| RDX | 6 | 0.377 | 0.19 | 0.28 | −0.11 |
| Pyrazole | 20 | 0.254 | 0.21 | 0.42 | −0.07 |
| Ketone | 7 | 0.099 | −0.13 | 0.25 | 0.15 |
| Linear | 18 | 0.003 | −1.12 | 0.52 | 0.00 |
Figure 4The learning curves for predicting detonation velocity (left) and detonation pressure (right) for the combined (N = 418) dataset plotted on a log-log plot. Shaded regions show the standard deviation of the error in 5-fold cross validation.