| Literature DB >> 35515036 |
Abstract
Registration, evaluation, and authorization of chemicals (REACH), the regulation of chemicals in use, imposes the characterization and report of the physicochemical properties of compounds. To cope with the financial burden of the experiments, the use of computational models is permitted for prediction of properties. Although a number of physicochemical property prediction models have been developed, their applicability domain is limited to organic molecules since most available data are concerned with organic molecules, and most of the molecular descriptors are restricted to organic molecule calculations. Prediction models developed for inorganic compounds were intended to predict endpoints relevant to novel material design. Therefore, no models were available for predicting endpoints of inorganic compounds that are significant to regulatory perspectives. In this study, boiling point, water solubility, melting point, and pyrolysis point prediction models were developed for inorganic compounds based on their composition. The electron configuration of each element in the molecule was used as a descriptor in this study. The dataset covered a wide range of endpoints and diverse elements in their structure. The performance of the models was measured using R 2, mean absolute error, and Spearman's correlation coefficient, and indicated good prediction accuracy of continuous endpoints and prioritization of inorganic compounds. This journal is © The Royal Society of Chemistry.Entities:
Year: 2020 PMID: 35515036 PMCID: PMC9056678 DOI: 10.1039/d0ra05873d
Source DB: PubMed Journal: RSC Adv ISSN: 2046-2069 Impact factor: 4.036
Fig. 1Data distribution in training set and test set were presented with molecular weight and the target endpoint (A, C, E and G). In the periodic table, element composition in training set and test set were presented (B, D, F and H). Color for each element was determined based on group which is marked as three if elements found both training set and test set, two if elements found in training set alone, and one if elements found in test set alone. Color bar next to the periodic table indicates color for each group.
n-Fold cross validation based hyperparameter optimization resultsa
| Endpoint | Model | Hyperparameters |
| SpeaR | MAE | MAE/the range | Endpoint range |
|---|---|---|---|---|---|---|---|
| BP | ANN | hid1*: (97, bN), act2*: tanh, opt3*: Adam, | 0.73 ± 0.13 | 0.8 ± 0.06 | 383.28 ± 88.9 | 6.54% | −268°C to 5590 °C |
|
|
|
|
| 5.50% | |||
| hid: (97, bN, 24, bN, 24, bN), act: relu, opt: RMSprop, | 0.83 ± 0.05 | 0.89 ± 0.04 | 298.93 ± 44.91 | 5.10% | |||
| SVM |
| 0.69 ± 0.08 | 0.81 ± 0.04 | 405.93 ± 90.53 | 6.93% | ||
| RFR | tree9*: 3000 | 0.72 ± 0.15 | 0.79 ± 0.07 | 342.66 ± 80.55 | 5.85% | ||
| log | ANN |
|
|
|
| 9.39% | −12.95°C to 1.75 °C |
| hid: (46, 23), act: tanh, opt: Adam, | 0.64 ± 0.04 | 0.79 ± 0.02 | 1.24 ± 0.07 | 8.44% | |||
| SVM |
| 0.52 ± 0.06 | 0.74 ± 0.04 | 1.48 ± 0.13 | 10.07% | ||
| RFR | tree: 10 000 | 0.27 ± 0.05 | 0.63 ± 0.02 | 1.7 ± 0.04 | 11.56% | ||
| MP | ANN |
|
|
|
| 4.87% | −259.16°C to 3880 °C |
| hid: (105, bN, 52, bN), act: sigmoid, opt: Adam, | 0.87 ± 0.01 | 0.93 ± 0.00 | 180.46 ± 13.01 | 4.36% | |||
| SVM |
| 0.79 ± 0.01 | 0.89 ± 0.01 | 231.19 ± 9.87 | 5.59% | ||
| RFR | tree: 10 000 | 0.66 ± 0.03 | 0.80 ± 0.02 | 284.67 ± 13.18 | 6.88% | ||
| PP | ANN |
|
|
|
| 9.89% | −185°C to 1980 °C |
| hid: (93, bN, 46, bN), act: relu, opt: Adam, | 0.28 ± 0.12 | 0.58 ± 0.08 | 209.0 ± 28.38 | 9.65% | |||
| SVM |
| 0.23 ± 0.07 | 0.57 ± 0.1 | 212.45 ± 19.84 | 9.81% | ||
| RFR | tree: 300 | −0.34 ± 0.52 | 0.38 ± 0.1 | 262.59 ± 34.43 | 12.13% |
1*hid: hidden layer architecture indicating number of hidden nodes in each layer, 2*act: activation function, 3*opt: optimizer, 4*lambda (λ): regularization parameter in ANN, 5*dr: drop out ratio, 6*gamma (γ): normalization option, 7*C: regularization parameter in SVM, 8*epsilon (ε): parameter for no penalty region, 9*tree: number of trees used in random forest model. *Hyperparameters in bold were considered as best performing option for the models.
Prediction accuracy comparison between different descriptors
| Endpoint | Electron configuration | Element composition | Matminer | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Num. |
| SpeaR | MAE | Num. |
| SpeaR | MAE | Num. |
| SpeaR | MAE | |
| BP | 97 | 0.8 ± 0.1 | 0.89 ± 0.05 | 322.36 ± 74.68 | 91 | 0.62 ± 0.09 | 0.83 ± 0.06 | 411.57 ± 72.31 | 132 | −1.69 ± 3.87 | 0.24 ± 0.13 | 876.02 ± 256.19 |
| log | 93 | 0.59 ± 0.09 | 0.79 ± 0.04 | 1.38 ± 0.1 | 77 | 0.46 ± 0.17 | 0.76 ± 0.03 | 1.44 ± 0.12 | 151 | −0.18 ± 0.21 | 0.19 ± 0.03 | 2.45 ± 0.12 |
| MP | 105 | 0.81 ± 0.09 | 0.92 ± 0.01 | 201.52 ± 21.78 | 102 | 0.72 ± 0.07 | 0.89 ± 0.01 | 249.50 ± 17.42 | 126 | 0.42 ± 0.04 | 0.52 ± 0.03 | 458.66 ± 17.38 |
| PP | 93 | 0.31 ± 0.18 | 0.55 ± 0.08 | 214.12 ± 35.12 | 75 | −0.53 ± 0.92 | 0.51 ± 0.11 | 241.34 ± 41.73 | 148 | 0.08 ± 0.08 | 0.17 ± 0.12 | 254.23 ± 21.71 |
Number of feature used in model development after removing features whose standard deviation was 0.
Performance of the selected artificial neural network models training and external test set
| Endpoint | Epoch size |
| SpeaR | MAE | The range of endpoint | MAE/the range | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Train | Test | Train | Test | Train | Test | Train | Test | Train | Test | ||
| BP | 500 | 0.96 | 0.88 | 0.95 | 0.94 | 178.7 | 222.65 | −268 to 5590 | −188.11 to 4785 | 3.05% | 4.48% |
| log | 200 | 0.73 | 0.63 | 0.85 | 0.83 | 1.16 | 1.26 | −12.95 to 1.75 | −12.00 to 1.49 | 7.89% | 9.16% |
| MP | 500 | 0.93 | 0.89 | 0.95 | 0.93 | 136.41 | 170.39 | −259.16 to 3880 | −219.67 to 3414 | 3.30% | 4.69% |
| PP | 500 | 0.61 | 0.66 | 0.68 | 0.76 | 167.29 | 147.55 | −185 to 1980 | −40 to 1870 | 7.73% | 7.73% |
Fig. 2Prediction model accuracy was visualized by comparing prediction result on training set and test set (A, C, E and G). Absolute errors averaged in ranges of the properties were presented (B, D, F and H).
Fig. 3Input PCAs are feature space of electron configuration descriptors (A, E, I and M). Layer 1 PCAs are from linear combination of electron configuration in first hidden layer (B, F, J and N). Activation PCAs are from converted linear combination by each activation function (C, G, K and O). Batch normalization PCAs are from output of batch normalization layer (D, H, L and P). By examining, plots from input PCA to batch normalization PCA, feature space change is visualized layer by layer. Color gradient represents numerical values of the properties. Bright yellow is the highest values whereas dark purple is the lowest.
Fig. 4Procedure of calculating the electronic configuration bit vectors for inorganic compounds, taking Al2(MoO4)3 as an example molecule.