| Literature DB >> 35956845 |
Amit Kumar Halder1,2, Reza Haghbakhsh3,4, Iuliia V Voroshylova1, Ana Rita C Duarte4, Maria Natalia D S Cordeiro1.
Abstract
Deep eutectic solvents (DES) are an important class of green solvents that have been developed as an alternative to toxic solvents. However, the large-scale industrial application of DESs requires fine-tuning their physicochemical properties. Among others, surface tension is one of such properties that have to be considered while designing novel DESs. In this work, we present the results of a detailed evaluation of Quantitative Structure-Property Relationships (QSPR) modeling efforts designed to predict the surface tension of DESs, following the Organization for Economic Co-operation and Development (OECD) guidelines. The data set used comprises a large number of structurally diverse binary DESs and the models were built systematically through rigorous validation methods, including 'mixtures-out'- and 'compounds-out'-based data splitting. The most predictive individual QSPR model found is shown to be statistically robust, besides providing valuable information about the structural and physicochemical features responsible for the surface tension of DESs. Furthermore, the intelligent consensus prediction strategy applied to multiple predictive models led to consensus models with similar statistical robustness to the individual QSPR model. The benefits of the present work stand out also from its reproducibility since it relies on fully specified computational procedures and on publicly available tools. Finally, our results not only guide the future design and screening of novel DESs with a desirable surface tension but also lays out strategies for efficiently setting up silico-based models for binary mixtures.Entities:
Keywords: DES; QSPR; consensus modeling; in silico-based models; surface tension; validation
Mesh:
Substances:
Year: 2022 PMID: 35956845 PMCID: PMC9370217 DOI: 10.3390/molecules27154896
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.927
Figure 1Basic workflow chart for the QSPR regression modeling, followed in this study.
Statistical results of the top 15 unique QSPR regression models generated.
| Model | Seed; Interval | Descriptor a | Split b | Scoring c |
|
|
| MAELOO g |
|
| MAEtest j | Avg k |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| M01 | 2; 5 | Method-1 | MO | NMAE | 408 | 0.884 | 0.854 | 2.586 | 127 | 0.907 | 3.863 | 0.882 |
| M02 | 4; 5 | Method-2 | MO | NMAE | 435 | 0.873 | 0.849 | 2.658 | 100 | 0.899 | 3.966 | 0.874 |
| M03 | 1; 4 | Method-2 | MO | NMAE | 408 | 0.898 | 0.854 | 2.775 | 127 | 0.865 | 2.569 | 0.872 |
| M04 | 5; 4 | Method-2 | MO | NMAE | 409 | 0.898 | 0.855 | 2.771 | 126 | 0.862 | 2.584 | 0.872 |
| M05 | 4; 5 | Method-1 | MO | NMAE | 435 | 0.871 | 0.839 | 2.635 | 100 | 0.898 | 4.039 | 0.869 |
| M06 | 3; 5 | Method-1 | MO | NMAE | 443 | 0.881 | 0.849 | 2.671 | 92 | 0.871 | 4.073 | 0.867 |
| M07 | 1; 3 | Method-1 | MO | NMAE | 359 | 0.901 | 0.862 | 2.369 | 176 | 0.836 | 4.223 | 0.866 |
| M08 | 4; 5 | Method-1 | MO | R2 | 435 | 0.883 | 0.858 | 2.854 | 100 | 0.855 | 4.695 | 0.865 |
| M09 | 4; 3 | Method-1 | MO | NMAE | 360 | 0.906 | 0.854 | 2.322 | 175 | 0.83 | 4.389 | 0.864 |
| M10 | 1; 2 | Method-2 | CO | R2 | 301 | 0.931 | 0.903 | 1.660 | 234 | 0.754 | 6.030 | 0.862 |
| M11 | 4; 5 | Method-2 | MO | 5-fold | 435 | 0.865 | 0.841 | 3.000 | 100 | 0.876 | 4.282 | 0.861 |
| M12 | 4; 3 | Method-2 | MO | R2 | 360 | 0.908 | 0.882 | 2.608 | 175 | 0.783 | 5.134 | 0.858 |
| M13 | 4; 5 | Method-1 | MO | 5-fold | 435 | 0.871 | 0.845 | 2.706 | 100 | 0.857 | 4.626 | 0.858 |
| M14 | 1; 4 | Method-1 | MO | 10-fold | 408 | 0.869 | 0.849 | 3.340 | 127 | 0.847 | 2.050 | 0.855 |
| M15 | 5; 4 | Method-1 | MO | 10-fold | 409 | 0.869 | 0.85 | 3.336 | 126 | 0.844 | 2.052 | 0.855 |
Descriptor calculation method used. Data splitting scheme utilized. Scoring condition applied. Number of data points in the training set. Leave-one-out cross-validation determination coefficient. Leave-chemical-out cross-validation determination coefficient. LOO cross-validation mean absolute error. Number of data points in the test set. Variance explained for external prediction. Mean absolute error of the test set. Average value of Q2LOO, Q2LCO and R2Pred.
Internal and external predictivity for the top 15 regression models against the training, test and external validation sets .
| Model b | Training Set | Test Set | External Validation Set | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
| %AARD |
|
| %AARD |
|
| %AARD | |
| M01 | 408 | 0.884 | 0.854 | 5.541 | 127 | 0.907 | 11.843 | 84 | −0.335 | 15.931 |
| M02 | 435 | 0.873 | 0.849 | 6.11 | 100 | 0.899 | 7.517 | 84 | 0.464 | 12.176 |
| M03 | 408 | 0.898 | 0.854 | 6.063 | 127 | 0.865 | 5.418 | 84 | −0.196 | 15.833 |
| M04 | 409 | 0.898 | 0.855 | 6.057 | 126 | 0.862 | 5.446 | 84 | −0.19 | 15.818 |
| M05 | 435 | 0.871 | 0.839 | 5.965 | 100 | 0.898 | 7.538 | 84 | 0.392 | 11.838 |
| M06 | 443 | 0.881 | 0.849 | 6.052 | 92 | 0.871 | 7.65 | 84 | 0.516 | 11.021 |
| M07 | 359 | 0.901 | 0.862 | 5.222 | 176 | 0.836 | 9.331 | 84 | −7.225 | 27.7 |
| M08 | 435 | 0.883 | 0.858 | 6.6 | 100 | 0.855 | 8.456 | 84 | 0.466 | 11.45 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| M11 | 435 | 0.865 | 0.841 | 6.875 | 100 | 0.876 | 7.78 | 84 | 0.568 | 10.047 |
|
|
|
|
|
|
|
|
|
|
|
|
| M13 | 435 | 0.871 | 0.845 | 6.204 | 100 | 0.857 | 8.442 | 84 | 0.544 | 11.093 |
| M14 | 408 | 0.869 | 0.849 | 7.344 | 127 | 0.847 | 3.943 | 84 | 0.352 | 11.642 |
| M15 | 409 | 0.869 | 0.85 | 7.339 | 126 | 0.844 | 3.929 | 84 | 0.353 | 11.638 |
For the meaning of Ntr, Nts, Q2LOO, Q2LCO, R2Pred and %AARD, please check the footnotes of Table 1. The more predictive models are marked in bold. Number of data points in the external validation set.
The five WM molecular descriptors selected for model M12—Equation (3).
| Symbol | Definition [ | Class |
|---|---|---|
| P_VSA_MR_6pmix | P_VSA-like on Molar Refractivity, at bin size 6 | P_VSA-like descriptor |
| Eig02_EA(dm)pmix | eigenvalue n. 2 from edge adjacency matrix, weighted by dipole moment | Edge adjacency indices ( |
| CATS2D_02_ANpmix | CATS2D Acceptor-Negative at lag 2 | 2D CATS |
| BLTF96pmix | Verhaar Fish base-line toxicity from MLOGP (mmol/L) | Molecular properties ( |
| MATS5snmix | Moran autocorrelation of lag 5, weighted by I-state | 2D autocorrelations ( |
P_VSA-like descriptors stand for the van der Waals surface area (VSA) with a particular property (P), in this case, the molar refractivity (MR) [57]. Chemically Advanced Template Search (CATS) descriptors expressly designed to identify scaffold hops [58]. I-states are based on the Kier-Hall atomic electronegativity modified by the number of σ bonds, number of hydrogen atoms, number of electrons in π orbitals, and number of lone pair electrons [55,56].
MLR statistical results for model M12—Equation (3) .
| Training Set | Test Set | External Set |
|---|---|---|
| MAE = 5.134; | MAE = 1.777; | |
| %AARD = 5.805; | %AARD = 11.155 | %AARD = 4.418 |
R2: Determination coefficient; R2Adj: Adjusted R2; F(6,353): Fisher’s statistic; MAELOO and MAELCO: Leave-one-out and leave-chemicals-out cross-validation mean absolute error, respectively; r2(LOO) and ∆r2(LOO): LOO cross-validation r2 and its associated deviation, respectively; r2(test) and ∆r2(test): r2 of the test set and its associated deviation, respectively; r2(ext) and ∆r2(ext): r2 of the external test set and its associated deviation, respectively. For the meaning of Ntr, Nts, Nex, Q2LOO, Q2LCO, R2Pred, and %AARD, check the footnotes of Table 1 and Table 2.
Figure 2Relative deviations (%RD) between the predicted and observed DES surface tensions (left) and histogram plot of the distribution of %RD values (right).
Figure 3Predicted surface tension values vs. observed experimental ones (left) and Williams plot (right) obtained for model M12.
Figure 4Relative importance of the descriptors found in the best individual model M12.
Figure 5Comparison of surface tension calculated by the M12 model with literature data in the temperature range from 278.15-358.15 K for six DESs at atmospheric pressure. DES1: DL-menthol and octanoic acid (3:1); DES2: tetrabutylammonium chloride and arginine (8:1); DES3: tetraprpylammonium bromide and ethylene glycol (1:6); DES3: tetraprpylammonium bromide and ethylene glycol (1:6); DES4: N,N-diethylethanolammonium chloride and glycerol (1:5); DES5: choline chloride and glycerol (1:5); DES6: choline chloride and D-glucose (1:1).
Summary of the statistical parameters obtained from non-linear models based on different machine learning methods.
| Method a | Training Set ( | Test Set ( | External Set ( |
|---|---|---|---|
| 0.176 | 0.597 | not determined | |
| RF | 0.473 | 0.746 | not determined |
| SVM | 0.874 | 0.774 | 0.767 |
| MLP | 0.541 | 0.269 | not determined |
| GB | 0.453 | 0.471 | not determined |
a k-NN: k-Nearest Neighbors; RF: Random Forests; SVM: Support Vector Machines; NN-MLP: Neural Network Multilayer Perceptron; GB: Gradient boosting.
External predictivity of the best individual model M12 and consensus models (C1-C4) built with different combinations of the top three models (M09, M10 and M12).
| Consensus Models | Models | CM a |
|
| MAEtest d | %AARD e |
|---|---|---|---|---|---|---|
| C1 | M09, M10, M12 | 0 | 0.864 | 0.801 | 1.869 | 4.459 |
| C2 | M10 and M12 | 2 | 0.823 | 0.812 | 2.089 | 4.732 |
| C3 | M09 and M10 | 2 | 0.853 | 0.787 | 1.979 | 4.393 |
| C4 | M09 and M12 | None | ----- | ----- | ----- | ----- |
| M12 | ----- | ----- | 0.862 | 0.767 | 1.777 | 4.418 |
Method of Intelligent consensus prediction that yielded the best external validation result. Variance explained for the external prediction. Metric r2 for the test set. Mean absolute error for the test set. Absolute average relative deviation.