| Literature DB >> 30136001 |
Florbela Pereira1, João Aires-de-Sousa2.
Abstract
Machine learning (ML) algorithms were explored for the fast estimation of molecular dipole moments calculated by density functional theory (DFT) by B3LYP/6-31G(d,p) on the basis of molecular descriptors generated from DFT-optimized geometries and partial atomic charges obtained by empirical or ML schemes. A database was used with 10,071 structures, new molecular descriptors were designed and the models were validated with external test sets. Several ML algorithms were screened. Random forest regression models predicted an external test set of 3368 compounds achieving mean absolute error up to 0.44 D. The results represent a significant improvement of the dipole moments calculated using empirical point charges located at the nucleus, even assuming the DFT-optimized geometry (root mean square error, RMSE, of 0.68 D vs. 1.53 D and R2 = 0.87 vs. 0.66).Entities:
Keywords: Density functional theory (DFT); Machine learning (ML); Molecular dipole moment; Partial atomic charges; Quantitative structure property relationships (QSPR)
Year: 2018 PMID: 30136001 PMCID: PMC6104469 DOI: 10.1186/s13321-018-0296-5
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Comparison of DFT DMs with DMNBO and DMPEOE for the 6703 and 3368 chemical structures of the training and test sets, respectively
| Set | Chargea | MAE (D) | R2/RMSE (D) |
|---|---|---|---|
| Trb | NBO | 1.44 | 0.626/1.80 |
| PEOE | 0.988 | 0.650/1.53 | |
| Tec | NBO | 1.44 | 0.656/1.78 |
| PEOE | 0.968 | 0.659/1.53 |
aThe DMs were calculated using the DFT geometry optimization
bTraining data set
cTest data set
Prediction of the DFT DM by random forests on the basis of different molecular descriptors
| Descriptors (#) | Training seta | Test set | ||
|---|---|---|---|---|
| MAE (D) | R2/RMSE (D) | MAE (D) | R2/RMSE (D) | |
| RDF_Nb (384) | 0.944 | 0.480/1.332 | 0.947 | 0.498/1.344 |
| RDF_Pc (384) | 0.890 | 0.512/1.295 | 0.882 | 0.549/1.287 |
| PchmDM_Nb (360) | 0.924 | 0.545/1.267 | 0.880 | 0.589/1.250 |
| PchmDM_Pc (360) | 0.873 | 0.569/1.240 | 0.931 | 0.566/1.278 |
| CDKd (47) | 0.983 | 0.434/1.385 | 0.985 | 0.445/1.402 |
| MACCS FPe (166) | 0.790 | 0.579/1.195 | 0.775 | 0.609/1.182 |
| PubChem FPe (881) | 0.817 | 0.547/1.238 | 0.801 | 0.584/1.217 |
| CDK FPe (1024) | 0.880 | 0.501/1.301 | 0.874 | 0.521/1.305 |
aOOB estimation
bDescriptors calculated using NBO charges
cDescriptors calculated using PEOE charges
dGeometric CDK descriptors
eFingerprints
Prediction of the DFT DM by random forests using DMNBO or DMPEOE
| Descriptors | Training seta | Test set | ||
|---|---|---|---|---|
| MAE (D) | R2/RMSE (D) | MAE (D) | R2/RMSE (D) | |
| RDF_N + DMNBOb | 0.639 | 0.747/0.930 | 0.638 | 0.761/0.929 |
| RDF_P + DMPEOEc | 0.624 | 0.740/0.946 | 0.615 | 0.765/0.929 |
| PchmDM_N + DMNBOb | 0.647 | 0.753/0.924 | 0.651 | 0.769/0.921 |
| PchmDM_P + DMPEOEc | 0.639 | 0.735/0.953 | 0.630 | 0.759/0.936 |
| CDK + DMNBOd | 0.713 | 0.705/1.00 | 0.700 | 0.724/0.990 |
| CDK + DMPEOEe | 0.708 | 0.685/1.03 | 0.704 | 0.705/1.02 |
| MACCS + DMNBO | 0.526 | 0.806/0.813 | 0.507 | 0.826/0.792 |
| MACCS + DMPEOE | 0.563 | 0.777/0.873 | 0.543 | 0.801/0.847 |
aOOB estimation
bDescriptors calculated using NBO charges, and DMNBO
cDescriptors calculated using PEOE charges, and DMPEOE
dGeometric CDK, and DMNBO
eGeometric CDK, and DMPEOE
Prediction of the DFT DM by random forests using NBO charges and combining different descriptors
| Models | Training seta | Test set | ||
|---|---|---|---|---|
| MAE (D) | R2/RMSE (D) | MAE(D) | R2/RMSE (D) | |
| Ab | 0.627 | 0.757/0.912 | 0.623 | 0.774/0.905 |
| Bc | 0.616 | 0.762/0.903 | 0.611 | 0.778/0.896 |
| Cd | 0.525 | 0.823/0.780 | 0.512 | 0.846/0.752 |
| De | 0.522 | 0.824/0.777 | 0.509 | 0.846/0.750 |
| Ef | 0.562 | 0.790/0.850 | 0.553 | 0.807/0.838 |
| Fg | 0.497 | 0.837/0.749 | 0.479 | 0.860/0.719 |
aOOB estimation
bRDF pairs NBO charges, PchmDM NBO charges, and DMNBO
cRDF pairs NBO charges, PchmDM NBO charges, geometric CDK, and DMNBO
dRDF pairs NBO charges, PchmDM NBO charges, DMPEOE, and DMNBO
eRDF pairs NBO charges, PchmDM NBO charges, geometric CDK, DMPEOE, and DMNBO
fRDF pairs NBO charges, PchmDM NBO charges, MACCS fingerprints, and DMNBO
gRDF pairs NBO charges, PchmDM NBO charges, MACCS fingerprints, DMPEOE, and DMNBO
Prediction of the DFT DM by random forests using a combination of DMNBO and DMPEOE with different type of descriptors
| Models | Training seta | Test set | ||
|---|---|---|---|---|
| MAE (D) | R2/RMSE (D) | MAE(D) | R2/RMSE (D) | |
| MACCSb | 0.460 | 0.853/0.707 | 0.444 | 0.872/0.680 |
| RDFb,c | 0.533 | 0.817/0.791 | 0.522 | 0.839/0.767 |
| PchmDMb,d | 0.540 | 0.819/0.786 | 0.533 | 0.839/0.765 |
| CDKb | 0.538 | 0.815/0.793 | 0.525 | 0.837/0.765 |
aOOB estimation
bCombining DMPEOE, and DMNBO
cRDF pairs NBO charges
dPchmDM NBO charges
RF prediction of DM with subsets of descriptors from models C and F
| Model/no descriptors | MAE (D) | R2/RMSE (D) |
|---|---|---|
| Training set | ||
| C/75a | 0.515 | 0.826/0.769 |
| C/100a | 0.517 | 0.826/0.771 |
| C/125a | 0.518 | 0.826/0.7709 |
| C/32b | 0.521 | 0.824/0.773 |
| C/39c | 0.523 | 0.824/0.775 |
| C/296d | 0.529 | 0.821/0.782 |
| F/75a | 0.482 | 0.844/0.731 |
| F/100a | 0.482 | 0.844/0.731 |
| F/125a | 0.483 | 0.844/0.732 |
| F/34b | 0.501 | 0.838/0.745 |
| F/41c | 0.499 | 0.838/0.743 |
| F/297d | 0.503 | 0.832/0.755 |
| Test set | ||
| C/75a | 0.502 | 0.847/0.744 |
| C/100a | 0.506 | 0.846/0.748 |
| C/125a | 0.505 | 0.845/0.748 |
| C/32b | 0.509 | 0.845/0.747 |
| C/39c | 0.512 | 0.844/0.749 |
| C/296d | 0.519 | 0.843/0.758 |
| F/75a | 0.468 | 0.864/0.704 |
| F/100a | 0.466 | 0.865/0.702 |
| F/125a | 0.466 | 0.867/0.699 |
| F/34b | 0.481 | 0.860/0.713 |
| F/41c | 0.479 | 0.860/0.710 |
| F/297d | 0.485 | 0.852/0.725 |
aOOB estimation for the training set
bUsing the mean decrease in accuracy measure of importance for the descriptors in the RF algorithm
cUsing the the CFS with BestFirst routine from Weka
dUsing the the CFS with GreedyStepwise routine from Weka
eUsing the the CFS with PSOSearch routine from Weka
Fig. 1Predicted versus DFT-calculated DM for the 3368 molecular structures of the external test set. a Predictions obtained with the RF C model trained with 75 descriptors; b predictions obtained with the RF F model trained with 75 descriptors
Exploration of different ML algorithms in the prediction of DFT DM using the 75 most important descriptors obtained for model C
| ML | MAE (D) | R2/RMSE (D) |
|---|---|---|
| Training set | ||
| SVM | 0.531 | 0.819/0.783 |
| MLP | 0.550 | 0.813/0.797 |
| RBF | 0.561 | 0.811/0.801 |
| Test set | ||
| SVM | 0.526 | 0.840/0.755 |
| MLP | 0.531 | 0.836/0.763 |
| RBF | 0.538 | 0.839/0.757 |
Predictions for the training set were obtained with ten-fold cross-validation experiments
Random forest prediction of the DFT DM for the external test sets I and II
| Models/test sets | MAE (D) | R2/RMSE (D) |
|---|---|---|
| C/Ia | 0.559 | 0.598/0.782 |
| F/Ia | 0.462 | 0.690/0.656 |
| MACCS_DMNBO_DMPEOE/Ia | 0.370 | 0.752/0.573 |
| C/IIb | 1.362 | 0.938/1.63 |
| F/IIb | 1.344 | 0.944/1.600 |
| MACCS_DMNBO_DMPEOE/IIb | 1.292 | 0.927/1.545 |
aTest set I comprising 200 molecules calculated at the B3LYP/6-31G(d,p) level
bTest set II comprising 16 molecules calculated at the B3LYP/6-31G(d) level
Fig. 2Predicted by the model MACCS_DMNBO_DMPEOE versus DFT-calculated DM [2] for the 200 molecular structures of the test set I
Fig. 3Predicted by the model MACCS_DMNBO_DMPEOE versus DFT-calculated DM [8] for the 16 molecular structures of the test set II
Fig. 4Illustration of predicted DM values for two molecules of test set II, obtained by the best RF MACCS_DMNBO_DMPEOE model, DFT calculations [8], and the ChemAxon CXCALC tool