Literature DB >> 35012054

Atomistic Descriptors for Machine Learning Models of Solubility Parameters for Small Molecules and Polymers.

Mingzhe Chi¹, Rihab Gargouri², Tim Schrader¹, Kamel Damak², Ramzi Maâlej², Marek Sierka¹.

Abstract

Descriptors derived from atomic structure and quantum chemical calculations for small molecules representing polymer repeat elements were evaluated for machine learning models to predict the Hildebrand solubility parameters of the corresponding polymers. Since reliable cohesive energy density data and solubility parameters for polymers are difficult to obtain, the experimental heat of vaporization ΔHvap of a set of small molecules was used as a proxy property to evaluate the descriptors. Using the atomistic descriptors, the multilinear regression model showed good accuracy in predicting ΔHvap of the small-molecule set, with a mean absolute error of 2.63 kJ/mol for training and 3.61 kJ/mol for cross-validation. Kernel ridge regression showed similar performance for the small-molecule training set but slightly worse accuracy for the prediction of ΔHvap of molecules representing repeating polymer elements. The Hildebrand solubility parameters of the polymers derived from the atomistic descriptors of the repeating polymer elements showed good correlation with values from the CROW polymer database.

Entities: Chemical

Keywords: machine learning; polymer; properties prediction

Year: 2021 PMID： 35012054 PMCID： PMC8747575 DOI： 10.3390/polym14010026

Source DB: PubMed Journal: Polymers (Basel) ISSN： 2073-4360 Impact factor: 4.329

1. Introduction

Computer-aided predictions of polymer solubility and miscibility with small molecules and drugs are of fundamental importance in a number of industrial applications, including the use of polymers as drug carriers in the growing field of nanomedicine [1]. Among various approaches, solubility and miscibility predictions based on Hildebrand solubility parameters are often used for polymer blends, polymer solutions and polymer–drug mixtures [2]. The Hildebrand model uses a solubility parameter (SP), , defined as the square root of the cohesive energy density: where is the cohesive energy, and is the molar volume. The miscibility of two substances can be estimated by comparing the absolute value of the difference in their SPs. If it is more than 2 MPa1/2, the two substances are deemed immiscible, and with a difference of less than 2 MPa1/2, they are considered miscible [3]. The factor 2 MPa1/2 was determined on the basis of empirical considerations [3]. The Hildebrand SP can also be used to roughly estimate the Flory–Huggins interaction parameter [4], which is another useful tool for predicting the miscibility of polymer blends [5]. For low-molecular-weight compounds, and can be estimated from the heat of vaporization: where is the heat of vaporization [2]. However, for polymers, SP is difficult to obtain from experiments [6]. Various experimental methods can be used to indirectly derive SP, such as hot-stage microscopy, differential scanning calorimetry (DSC) and ultraviolet spectroscopy [7,8], but these methods provide limited accuracy and can only be used for a small range of polymer species. In addition to experimental methods, SP can also be calculated by empirical approaches and computer simulations. A group-contribution method (GC) is an empirical approach, which uses the sum of the contributions of structural and functional groups to estimate polymer properties [9]. GC is easy to apply but has limited accuracy due to the use of empirical assumptions. Although new GC approaches are being developed, a general model that can cover a wide range of polymer species and polymer properties is not available [10]. Atomistic simulations employing force fields and interatomic potential functions are another tool for predicting polymer properties [5,11]. However, accurate SP predictions using atomistic simulations are computationally demanding, especially for polymers and compounds with complex structures [12]. In this regard, data-driven approaches based on machine learning (ML) models have become an appealing alternative to simple empirical approaches and atomistic simulations. ML models have been developed to predict the physical or chemical properties of materials with good accuracy, including solubility parameters [11]. However, the predictive power of ML models depends heavily on the availability of accurate and consistent target data covering a wide structural and compositional range, as well as unbiased descriptors, i.e., readily available observables that can be linked to the target property [13]. Such observables can be derived, e.g., from experimental data or from quantum chemical or atomistic calculations [13,14]. It is difficult to obtain relevant experimental data on target properties and descriptors for predicting the SP of polymers. In the case of descriptors, one approach is to use the features derived for monomer molecules [13]. However, polymer properties may be fundamentally different from those of the monomer. Therefore, small organic molecules that are structurally similar to the repeating element (RE) of the polymer may be a better choice. For a given polymer, different molecules can be identified that represent REs, as shown in Figure 1 for polyethylene glycol (PEG). Both ethylene glycol and ethanol, as polar molecules forming strong hydrogen bonds, are poor choices for deriving molecular descriptors for ML models for PEG.

Figure 1

Polyethylene glycol (PEG) and different choices of small molecules with a structural motif of the repeating element of PEG.

In this work, descriptors derived from the atomic structure and quantum chemical calculations of small molecules as potential polymer REs are evaluated for ML models of the polymer SP. Since reliable cohesive energy density and SP data for polymers are difficult to obtain in experiments and simulations, a surrogate target property is used to evaluate the descriptors, namely, the experimental heat of vaporization of the small molecules. For low-molecular-weight compounds, can usually be determined with good accuracy [15]. Subsequently, the relationship between of the polymer RE and the available SP of the polymers is investigated.

2. Method

2.1. Molecular Datasets

ML models for predicting were trained and tested on a dataset of small organic molecules including hydrocarbons, alcohols, acids, amines, ketones, aldehydes, nitriles, organic chlorides and benzene derivatives. A summary of the dataset is shown in Table 1. Figure 2 shows examples of the largest molecules used. The ML models were then applied to another dataset of organic molecules with structural similarity to REs of popular polymers to predict and correlate it with the polymer SP. This dataset is summarized in Table 2.

Table 1

Summary of the molecules included in the training dataset with experimental (see Supplementary Materials, Tables S1 and S6).

Type	Formula	Size n	Number of Molecules
hydrocarbons	C_nH_2n+2	1–10	10
acids	C_nH_2n+1COOH	0–8	9
alcohols	C_nH_2n+1OH	1–9	9
ketones	C_nH_2nO/C₆H₅COCH₃	3–7	6
amines	C_nH_2n+1NH₂	1–6	5
aldehydes	C_n−1H_2n−1CHO/C₆H₅CHO	3–6	5
nitriles	C_nH_2n+1CN	1, 3–6	5
organic chlorides	C_nH_2n+1Cl	1, 3–6	5
benzene derivatives	C₆H₆/C₆H₅OCH₃/C₆H₅OCH₂CH₃/C₆H₅CH₂OCH₃/C₆H₅C_nH_2n+1	1, 2, 4	7

Figure 2

The largest molecule of each type used in the training dataset: decane (C10H22), nonanoic acid (C8H17COOH), nonanol (C9H19OH), 2-heptanone (C7H14O), hexylamine (C6H13NH2), hexanal (C5H11CHO), hexanenitrile (C5H11CN), chloropentane (C5H11Cl) and butylbenzene (C6H5C4H9).

Table 2

Summary of polymer repeating elements (REs) included in the validation set with experimental (see Supplementary Materials, Tables S2 and S7).

Polymer	Formula	RE	Formula of RE
poly(acrylic acid)	(C₃H₄O2)_n	propanoic acid	CH₃CH₂COOH
poly(allyl cyanide)	(C₄H₅N)_n	butanenitrile	CH₃CH₂CH2CN
polyacrylonitrile	(C₃H₃N)_n	propanenitrile	CH₃CH₂CN
polybutylene	(C₄H₈)_n	butane	CH₃CH₂CH₂CH₃
polyethylene (HDPE)	(C₂H₄)_n	ethane	CH₃CH₃
poly(ethylene glycol)	(C₂H₄O)_n	dimethyl ether	CH₃OCH₃
cis-1,4-polyisoprene	(C₅H₈)_n	2-methyl-2-butene	CH₃CHC(CH₃)₂
polyisobutene	(C₄H₈)_n	isobutane	(CH₃)₂CHCH₃
polymethacrylonitrile	(C₄H₅N)_n	isobutyronitrile	(CH₃)₂CHCN
poly(methyl methacrylate)	(C₅H₈O₂)_n	methyl butyrate	CH₃CH₂CH₂COOCH₃
polypropylene	(C₃H₆)_n	propane	CH₃CH₂CH₃
polystyrene	(C₈H₈)_n	ethylbenzene	C₆H₅C₂H₅
poly(vinyl alcohol)	(C₂H₄O)_n	ethanol	CH₃CH₂OH
poly(vinyl acetate)	(C₄H₆O2)_n	ethyl acetate	CH₃COOCH₂CH₃
poly(vinyl chloride)	(C₂H₃Cl)_n	chloroethane	CH₃CH₂Cl
poly(vinyl ethyl ether)	(C₄H₆O)_n	diethyl ether	CH₃CH₂OCH₂CH₃

The organic molecules in both datasets cover a wide range of values, from 8.19 kJ/mol (methane) to 69.00 kJ/mol (heptanoic acid), and represent different chemical structures. The experimental values of for all molecules, measured around a normal boiling point, were collected from the literature (see Supplementary Materials).

2.2. Computational Details

All density functional theory calculations were performed as a gas phase using the Turbomole program package [16]. The Becke 3-parameter Lee–Yang–Parr (B3-LYP) [17] exchange–correlation functional was employed, along with triple zeta valence plus polarization (def2-TZVP) [18] basis sets and Grimme dispersion correction (DFT-D3) [19]. The geometry convergence criteria for DFT calculations were hartree for the energy change and hartree/bohr for the gradient norm. Count descriptors and topological descriptors were calculated with the PaDEL-Descriptor program [20]. Python 3 with the Scikit-learn package was used for building all machine learning models [21].

2.3. Molecular Descriptors

In the current work, four quantum chemical descriptors were obtained using DFT calculations: atomization energy (AE), quadrupole moment (QM), chemical hardness and electronegativity . There are different definitions for the quadrupole moment [22,23]. In the present work, the quadrupole moment was defined as the second moment of charge [23], and QM was taken as where , and are diagonal elements of the second moment of the charge tensor. Chemical hardness and electronegativity are chemical reactivity descriptors that were applied in an artificial neural network for predicting solvation energies [24]. They are defined as and where and denote the energies of the highest occupied (HOMO) and lowest unoccupied molecular orbitals (LUMO), respectively. The remaining descriptors, such as number of aromatic bonds (nAromBond) and number of heavy atoms (nHeavyAtom), were generated using the PaDEL-Descriptor [20] program based on the atomic structures of molecules. The descriptors were obtained using exactly the same method for both datasets (Table 1 and Table 2). The full specification of the descriptors is given in the Supplementary Materials.

2.4. Machine Learning Models

Two supervised machine learning models were used: multilinear regression (MLR) and kernel ridge regression (KRR) [25]. MLR is the simplest ML model using the least square method and has been widely applied in data analysis. The MLR model can be represented as a linear combination of all descriptors where is the coefficient of each descriptor , and is the intercept. The training of the MLR model involves the determination of the best and . All hyperparameters used for MLR training used the default values implemented in the Scikit-learn package. KRR is a combination of the kernel function and ridge regression, which is an improvement on the ordinary linear regression method [26,27]. There are several kernel functions available for different tasks, and in the present work, the polynomial kernel function was applied where and are descriptors, and hyperparameters c and d are the soft margin constant and degree of the polynomial kernel, respectively. The accuracy and performance of the model usually depend on the choice of hyperparameters. Since the dataset was relatively small, changes in parameters other than c and d had little effect on the model accuracy and were therefore set to constant values (alpha (regularization strength) = 0.001 and gamma = none). In the current work, c and d in Equation (7) were determined using the grid search function of Scikit-learn. Based on the grid search results, the value of c had little effect on the model and was finally set to 1. The models with d = 1 and d = 2 showed similar performance, and both models were retained for further study. Changes in parameters other than c and d had little effect on the model accuracy and were therefore left at the default values implemented in the Scikit-learn package. Due to the limited size of the dataset (61 molecules in total), the leave-one-out cross-validation (LOOCV) method was used in the current work, aiming to make optimal use of each sample and to obtain a more justified model. LOOCV is an extreme case of cross-validation, in which only one sample is selected for testing in each cycle, and the other samples are used to train the model until all samples have been selected once. The final model is optimized by averaging the LOOCV results. MLR and KRR models were trained with the same dataset, and LOOCV was applied for all models. After training and LOOCV, all models were used to predict of polymer REs (16 molecules in Table 2), and the performance of all models was analyzed [28] using root-mean-square error (RMSE) mean absolute error (MAE) and the average of relative error (ARE) where and are the reference and predicted values, respectively. In addition, the coefficient of determination (R2) was used to describe the proportion of variability in a dataset that can be explained by the model [29].

3. Discussion

3.1. Selection of Molecular Descriptors

Molecular descriptors were manually selected and filtered by analyzing multicollinearity based on correlation coefficients. For this, descriptors were selected in addition to the quantum chemical descriptors that showed low multicollinearity (correlation coefficient within ±0.75) and a high degree of correlation with . The 15 descriptors finally selected are shown in Figure 3.

Figure 3

Correlation coefficients among 15 descriptors. AE: atomization energy, QM: quadrupole moment, nAromBond: number of aromatic bonds; nHeavyAtom: number of heavy atoms (all but hydrogen); SsOH: sum of (-OH) E-States; SssO: sum of (-O-) E-States; nAcid: number of acidic groups; SHdsCH: sum of (=CH-) E-States; SsNH2: sum of (-NH2) E-States; SsCH3: sum of (-CH3) E-States; SHBa: sum of E-States for hydrogen bond acceptors; SHBd: sum of E-States for hydrogen bond donors; SssCH2: sum of (-CH2) E-States (see Supplementary Materials); : chemical hardness; : electronegativity.

Figure 3 shows that there is only weak correlation among most descriptors, which can reduce the risk of collinearity problems [30]. Reducing redundant and irrelevant descriptors also lowers the cost of training and reduces the possibility of an overfitting problem [14,31].

3.2. Predictions of for Small Organic Molecules

The final MLR model for predicting (in kJ/mol) is given as This model does not contain any unreasonably small or large factors for descriptors, which indicates that there are no irrelevant or redundant descriptors. Figure 4 shows that MLR performed well for the training set of small molecules and the LOOCV, according to the R2 score and other metrics. ARE for training (0.071) and LOOCV (0.105) showed the same trend as the other metrics. The MLR model showed good accuracy for predicting of molecules, with a maximum deviation of 12.43 kJ/mol for ethanoic acid. However, there are large disparities in the values of for ethanoic acid across the literature (from 23.7 (at 391.1 K) to 42 (at 305 K) kJ/mol) [32,33]. The overall deviation is within experimental accuracy. For the LOOCV, both the RMSE of 5.291 kJ/mol and MAE of 3.607 kJ/mol are within ranges indicative of good accuracy.

Figure 4

Performance of the MLR model for predictions of small molecules (ARE for the training set: 0.072 and for LOOCV: 0.105). , MAE and RMSE in kJ/mol.

The performance comparison of the final KRR (d = 1) and KRR (d = 2) models is shown in Figure 5.

Figure 5

Performance of the KRR models for predictions of small molecules: (a) d = 1 (ARE for the training set: 0.072 and for LOOCV: 0.106), (b) d = 2 (ARE for the training set: 0.026 and for LOOCV: 0.184). , MAE and RMSE in kJ/mol.

Compared to the MLR model, the KRR model (d = 1) did not perform better in training, but all metrics had a small lead in cross-validation, which showed slightly better stability. KRR (d = 2) performed best during training but was the worst in LOOCV, and this case was most likely due to the overfitting. Considering the size of the datasets used in the current work, high-scoring ML models trained with small datasets can often suffer from overfitting.

3.3. Predictions of for Polymer Repeating Elements

All three models were applied to predict of molecules representing polymer REs. Table 3 and Figure 6 show that the MLR and KRR (d = 1) models provided the best accuracy. The KRR (d = 2) model failed to predict of polymer RE, and the much larger error of the KRR (d = 2) suggests that the model was overfitted. As mentioned, KRR algorithms do not offer advantages on small datasets.

Table 3

Performance comparison of MLR and KRR models (MAE and RMSE in kJ/mol).

	Training Set (LOOCV)				Polymer REs
	R²	MAE	RMSE	ARE	R²	MAE	RMSE	ARE
MLR	0.805	3.607	5.291	0.105	0.819	3.533	4.348	0.118
KRR (d=1)	0.806	3.597	5.276	0.106	0.821	3.540	4.348	0.118
KRR (d=2)	0.328	2.520	15.661	0.184	0.311	23.472	36.536	0.820

Figure 6

predictions of polymer RE by MLR and KRR (d = 1), ARE for MLR: 0.118 and for KRR (d = 1): 0.118. , MAE and RMSE in kJ/mol.

Figure 4, Figure 5 and Figure 6 demonstrate that the MLR model showed slightly worse performance than the two KRR models during training and cross-validation, but the MAE of MLR for polymer RE was better than that of KRR (d = 1). Therefore, the MLR model and the KRR model (d = 1) in the current work have better extrapolation ability than the KRR (d = 2) models. However, the KRR algorithm with higher d could still yield better results for a larger dataset with more complex structures and chemical compositions.

3.4. Hildebrand Solubility Parameter of Polymers

Our results show that the heat of vaporization of small molecules and polymeric REs, and thus their SPs, can be predicted with good accuracy using the MLR and KRR (d = 1) models. The question now is how well the Hildebrand SP of RE correlates with the SP of the corresponding polymers. For this, Hildebrand SPs of polymers were collected from the CROW polymer database [34] with recommended values, and SPs of REs were calculated from MLR-predicted (see Supplementary Materials). Figure 7 shows the correlation of Hildebrand SPs between polymers and REs. The linear model yields an R2 value of 0.855.

Figure 7

Correlation of Hildebrandt SP between polymers and REs. All values in MPa1/2. Linear fit model: .

There are several factors that can affect the accuracy of Hildebrand SP predictions for polymers. First, the experimental values of SPs for polymers can only be determined indirectly, and the accuracy of such values is essentially indeterminate. Second, the SPs are determined not only by the internal structure of the polymer chains, reflected here in the descriptors derived from the polymer RE, but also by factors such as the degree of polymerization, polydispersity, and the nature of the end groups. Such factors cannot be determined from the properties of REs alone and must be derived from experimental data. How well the two descriptors, chemical hardness and electronegativity, actually help in the prediction of solubility parameters needs to be further investigated, as intuitively, the association between the two and polymer solubility is not strong. In addition, larger chemical structures, such as oligomers with several repeating units, may provide more information about inter- and intramolecular interactions and can improve the accuracy of machine learning models. Simulations of such structures are obviously less computationally expensive than simulations of polymers, but finding suitable descriptors may still be a challenge. This is also one of the pathways for future research studies.

4. Conclusions

In this work, descriptors derived from atomic structure and quantum chemical calculations for small molecules as potential polymer repeating elements were evaluated for machine learning models to predict the Hildebrand solubility parameters of the corresponding polymers. Since reliable cohesive energy density data and solubility parameters for polymers are difficult to obtain, the experimental heat of vaporization of small molecules was used as a proxy property to evaluate the descriptors. The multilinear and kernel ridge regression model (with polynomial kernel degree = 1) showed good and very similar performance in training, cross-validation and the prediction of molecules representing polymer repeating elements. The kernel ridge regression model (degree = 2) was strongly overfitted, which was revealed by its poor performance in cross-validation and prediction. The Hildebrand solubility parameters derived from the multilinear regression model for the of polymer repeating elements showed good correlation with the solubility parameters of the corresponding polymers collected from the CROW polymer database. However, atomistic descriptors derived from polymer repeating elements only reflect the internal structure of the polymer chains. More accurate models for predicting the Hildebrand solubility parameters of polymers must take into account additional relevant factors, such as the degree of polymerization, polydispersity and the nature of the polymer end groups. Such factors cannot be determined from the properties of the repeating elements of the polymer alone and must be derived from experimental data.

15 in total

1. Selection of excipients for melt extrusion with two poorly water-soluble drugs by solubility parameter calculation and thermal analysis.

Authors: A Forster; J Hempenstall; I Tucker; T Rades
Journal: Int J Pharm Date: 2001-09-11 Impact factor: 5.875

2. Quantum-Chemical Descriptors in QSAR/QSPR Studies.

Authors: Mati Karelson; Victor S. Lobanov; Alan R. Katritzky
Journal: Chem Rev Date: 1996-05-09 Impact factor: 60.622

Review 3. Polymer architecture and drug delivery.

Authors: Li Yan Qiu; You Han Bae
Journal: Pharm Res Date: 2006-01-11 Impact factor: 4.200

4. Quantum mechanically based estimation of perturbed-chain polar statistical associating fluid theory parameters for analyzing their physical significance and predicting properties.

Authors: Nguyen Van Nhu; Mahendra Singh; Kai Leonhard
Journal: J Phys Chem B Date: 2008-04-16 Impact factor: 2.991

5. Effect of the damping function in dispersion corrected density functional theory.

Authors: Stefan Grimme; Stephan Ehrlich; Lars Goerigk
Journal: J Comput Chem Date: 2011-03-01 Impact factor: 3.376

6. Empirical search for factors affecting mean particle size of PLGA microspheres containing macromolecular drugs.

Authors: Jakub Szlęk; Adam Pacławski; Raymond Lau; Renata Jachowicz; Pezhman Kazemi; Aleksander Mendyk
Journal: Comput Methods Programs Biomed Date: 2016-07-06 Impact factor: 5.428

7. Hildebrand and Hansen solubility parameters from molecular dynamics with applications to electronic nose polymer sensors.

Authors: M Belmares; M Blanco; W A Goddard; R B Ross; G Caldwell; S-H Chou; J Pham; P M Olofson; Cristina Thomas
Journal: J Comput Chem Date: 2004-11-30 Impact factor: 3.376

8. Thermodynamic compatibility of actives encapsulated into PEG-PLA nanoparticles: In Silico predictions and experimental verification.

Authors: Andreas Erlebach; Timm Ott; Christoph Otzen; Stephanie Schubert; Justyna Czaplewska; Ulrich S Schubert; Marek Sierka
Journal: J Comput Chem Date: 2016-07-18 Impact factor: 3.376

Review 9. Machine learning for molecular and materials science.

Authors: Keith T Butler; Daniel W Davies; Hugh Cartwright; Olexandr Isayev; Aron Walsh
Journal: Nature Date: 2018-07-25 Impact factor: 49.962