Literature DB >> 34124451

Solubility Prediction from Molecular Properties and Analytical Data Using an In-phase Deep Neural Network (Ip-DNN).

Atsushi Kurotani¹, Toshifumi Kakiuchi², Jun Kikuchi^1,3,4.

Abstract

Materials informatics is an emerging field that allows us to predict the properties of materials and has been applied in various research and development fields, such as materials science. In particular, solubility factors such as the Hansen and Hildebrand solubility parameters (HSPs and SP, respectively) and Log P are important values for understanding the physical properties of various substances. In this study, we succeeded at establishing a solubility prediction tool using a unique machine learning method called the in-phase deep neural network (ip-DNN), which starts exclusively from the analytical input data (e.g., NMR information, refractive index, and density) to predict solubility by predicting intermediate elements, such as molecular components and molecular descriptors, in the multiple-step method. For improving the level of accuracy of the prediction, intermediate regression models were employed when performing in-phase machine learning. In addition, we developed a website dedicated to the established solubility prediction method, which is freely available at "http://dmar.riken.jp/matsolca/".

Entities: Chemical Disease Gene Species

Year: 2021 PMID： 34124451 PMCID： PMC8190808 DOI： 10.1021/acsomega.1c01035

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

In recent years, the application of data-driven models has been implemented in various research and development fields such as materials science, biorefinery, cosmetic chemistry, and drug discovery, especially at the industrial level. Sophisticated machine learning techniques are now becoming ubiquitous for the prediction of the physicochemical properties and engineering parameters. In materials science, the increasing availability of large amounts of data (both analytical and computational) has been recently used to advance the tools available for materials informatics (MI). It is known that a variety of indexes are commonly used to describe the solubility of substances. Among these, SP is defined by regular solution theory proposed by Hildebrand and Scott,[1] and Hildebrand solubility parameters (HSPs) are trinomial components proposed by Hansen[2] that correspond to the dispersion energy (dD), dipole interaction energy (dP), and energy of hydrogen bonding (dH) between molecules. Log S is the base 10 logarithm of the solubility S [mol/L] in water. Log P is the base 10 logarithm of the octanol–water partition coefficient that indicates octanol solubility and therefore lipophilicity. In particular, they are needed in various research and development fields where solubility information of substances such as materials, pharmaceuticals, and food is required.[3−5] The calculation of the solubility values is mainly performed using the conventional group contribution method, although the machine learning method has also been attracting attention in recent years owing to the artificial intelligence boom along with the development of chemoinformatics and MI. In addition, simulation methods are often used as complementary techniques to the standard calculation of the solubility values.[6,7] The calculated solubility values by the group contribution method are based on the aggregation energy of the molecular structures (atoms, functional groups, etc.).[8] The group contribution method was developed in an early stage[9,10] and has been improved in recent years.[11−13] In addition, the application of the predicted Log S values to the group contribution method for drug delivery has also been reported.[14] The determination of the solubility values by machine learning methods relies on the prediction of these values by training known structural and physical properties on information related to the solubility as descriptors. As an example of prediction of Log S using machine learning, a report described how to calculate the desired value using a random forest to train the molecular descriptors of the CDK tool,[15] which is a chemoinformatic library in the Java language.[16] Another study predicted the Log S, Log P, melting point, and toxicity with a convolutional neural network (CNN) using the fingerprint of structural information as training data with SMILES strings.[17] Moreover, the prediction of SP, glass transition point, density, and so forth was performed by the Gaussian process regression (GPR) to train the molecular structure, quantitative structure–property relationship (QSPR)[18] descriptors that were obtained from the RDKIT tool,[19] and molecular morphological information, such as the side chain, distance between rings, and so forth.[20] HSPs were predicted using an improved MARS (multivariate adaptive regression splines[21]) method to train the QSPR molecular descriptors with the PaDEL tool[22] using SMILES strings.[23] HSPs were also predicted using GPR that trained the physical properties of compounds, such as the surface area, volume, and so forth, from molecular simulation data using SMILES string information.[24] As mentioned above, solubility-related predictions have been reported using various training data. However, the input data in these predicting methods require structure-related information, such as atoms, rings, bonds, functional groups, and molecular descriptors. The molecular descriptors can be obtained using chemoinformatic tools, such as RDKIT, CDK, and PaDEL, which demand at least one of the SMILES, SMARTS, sdf format, mol format, and so on. Therefore, when predicting the solubility of unknown substances with the abovementioned methods, structure-related information is required to be at least at the 2D level as input data. In contrast, analytical data, such as NMR spectra, offer an enormous amount of information regarding the local structure and functional groups.[25,26] In particular, 1H and 13C chemical shifts can be used as information to predict the local structure or the entire molecular structure with the aid of chemoinformatics, even in the case of the primary stage analysis of a complex mixture. Such NMR spectral information along with the refractive index and density can potentially be obtained as primary-stage analytical data.[27−36] Therefore, we developed a special solubility prediction tool using an in-phase DNN method, which is based exclusively on analytical data as input and allows us to improve the accuracy by regressing molecular information, including molecular composition and molecular descriptors, as intermediate data in a stepwise fashion (Figure method3 and Figure S1b). In addition, we developed a web tool (http://dmar.riken.jp/matsolca/) to calculate mainly HSPs, SP, and Log P from the analysis data, including the NMR information, refractive index, and density, as input data. In addition, we confirmed the applicability of this prediction tool to polymer data whenever analytical data of a polymer are available. We believe that this tool may accelerate the creation of novel designs and development of new materials since it allows us to predict the solubility from analytical data without the need for obtaining complete structural data.

Figure 1

Solubility prediction methods from analytical data. Method1 allows us to predict the solubility values by simply starting from analytical data (shown as “Anal. Data”) as input data using DNN. “RI” in “Anal. Data” means the refractive index. The numbers in parentheses show the number of the attributes for machine learning. Method2 is a 2-step DNN prediction method: In the first step, the molecular compositions (shown as “Mol. Comp.”) and molecular descriptors (shown as “Mol. Disc.”) are predicted from analytical data and are selected according to a defined threshold. Here, the molecular descriptors mean the data from RDKIT’s descriptors. In the second step, the solubility values are predicted from the analytical data and selected molecular properties. Method3 is a 3-step DNN prediction method: In the first step, the molecular compositions are predicted from analytical data and selected by a defined threshold. In the second step, the molecular descriptors are predicted from the analytical data and selected molecular compositions. In the third step, the solubility values are predicted from the analytical data and selected molecular properties. This solubility prediction method from analytical data using intermediate molecular properties in phase was named as the “in-phase deep neural network (:ip-DNN)”, and the image is shown at the bottom.

Materials and Methods

Dataset of Compounds, Solubility, and Analytical Data

In this study, we prepared a dataset with 307 common low-molecular weight compounds. In this dataset, the number of C atoms in each compound ranged from 1 to 9, while the number of compounds containing N, S, Si, halogen (F, Cl, and Br), −OH, >CO, −CHO, −COOH, or aromatic groups was 48, 24, 4, 76, 33, 20, 11, 5, and 28, respectively (Table S1a–c). Information regarding the solubility, analytical data, molecular composition, and molecular descriptors of these compounds was collected. The solubility data included HSP, SP, and Log P values. The HSP values were obtained from the DIPPR database,[37] while the SP values were calculated from three literature HSP values according to the formula: .[38] The Log P values were derived using Crippen’s computational Log P(s) also called as MolLogP,[39] which represents one of the molecular descriptors of RDKIT and can therefore be obtained with the RDKIT tool. The analytical data included 1D 1H NMR and 1D 13C NMR spectral data and refractive index and density values. It should be noted that the NMR spectral data were collected using the SBDB (spectral database of AIST[40]) and KnowItAll spectroscopy software (Bio-Rad Laboratories, Inc. 2018 version), while the refractive index and density values were obtained with the DIPPR database.[37] To simplify 1D 1H NMR and 1D 13C NMR spectral data, we converted the information regarding the peaks in the NMR spectra to the assignment information using the table of H/C-chemical shifts in organic compounds provided by Bruker.[41] The assignment information for 1D 1H NMR and 1D 13C NMR data is shown in Table S2a,b. Finally, we prepared 60 pieces of analytical data per compound, including 25 items of 1D 1H NMR, 33 items of 1D 13C NMR, a refractive index, and a density value.

Dataset of Molecular Compositions

We collected the conceivable general 11 items of molecular composition from chemical structural formula (H, C, N, S, Si, halogens, −OH, −CHO, >CO, −COOH, and aromatics), which are shown in Table S1b,c. Si, −COOH, and aromatics are excluded because Si and −COOH represent a small amount of data for training, and aromatics is included in the molecular descriptors of RDKIT. Therefore, we selected eight items (Table S3) of molecular composition as candidates for the feature value that correspond to the number of H and C, and the existence/absence of N, S, halogens, −OH, −CHO, and >CO is used as effective training data. In addition, using the item selection of eight molecular compositions from the DNN result, N is excluded owing to the lack of evaluation values (see also the DNN 2-step method in the Results and Discussion and Table S5b). The remaining seven items are included in the cascade in 2-step and 3-step DNN predictions as intermediate models.

Dataset of Molecular Descriptors

In this study, we use the molecular descriptors calculated with RDKIT derived from SMILES strings of each compound. Among a total of 200 molecular descriptors of RDKIT,[42] we selected 20 items (Table S4: Chi0n, Chi0v, Chi1v, HallKierAlpha, Kappa3, MaxPartialCharge, MinPartialCharge, MolMR, PEOE_VSA1, SlogP_VSA12, SMR_VSA5, SMR_VSA10, TPSA, VSA_EState9, NHOHCount, NumAromaticRings, NumHAcceptors, NumHDonors, NumHeteroatoms, and RingCount) as a candidate for the feature value, which are the top 70% between the highest and bottom level of the regression score, by calculating important factors[43] for the dD/dH/dP/SP regression model with random forest using the 200 molecular descriptors (Figure S2a–d). These items are generally used as either training data or objective variables for prediction. For the six molecular compositions based on −CHO, >CO, −OH, halogens, S, and N and the six molecular descriptors of RDKIT, that is, NHOHCount, NumAromaticRings, NumHAcceptors, NumHDonors, NumHeteroatoms and RingCount, we did not use their number but rather their presence or absence due to the fact that only few data correspond to more numbers higher than 1. This method based on the presence or absence of these items is indicated as presence/absence prediction, while the other is called as numerical prediction. In addition, using the item selection of 20 molecular descriptors considered from the DNN result, six items (PEOE_VSA1, Chi0v, Chi1v, MolMR, TPSA, and Kappa3) are excluded owing to the lack of evaluation values (see also the DNN 2-step method in the Results and Discussion and Table S5a). The remaining 14 items are included in the cascade in 2-step and 3-step DNN predictions as intermediate models.

Adjustment of Calculation Values

It should be noted that the values of the presence/absence prediction were adjusted to 0/1, while H and C were rounded to integers from the calculated value.

Calculation with Machine Learning

The training data were normalized with total data as preparation. <span class="Chemical">DNN calculations with a fivefold cross-validation were performed using Keras-Tensorflow, which is a neural network li<span class="Chemical">brary of python programs. The order of layers of the model is as follows: an input layer, hidden layer, activated layer, hidden layer, and output layer. The setting parameters at the time of the model calculation were the number of neurons of hidden layers (30–60), number of intermediate layers (fixed to 2), dropout rate (fixed to 0.5), activated layers (sigmoid, tanh, and relu), optimizer (adam and adagrad), learning rate (0.001–0.1), number of epochs (10–200), and batch size (32–64). The optimal values of the abovementioned parameters were determined using the Bayesian optimization method. For all other parameters reported as a range of values, the optimal items were determined using the all search (grid search) method. Random forest calculations were performed with a fivefold cross-validation using the caret package, which is a machine learning package of the R program.[44] XGBoost calculations were also performed with a fivefold cross-validation using Python’s XGBRegressor library. The setting optimal parameters of XGBoost for the learning rate, max depth, subsample, and colsample by the tree were determined using the Bayesian optimization method.

Test Data and Training Data for Machine Learning

Among all 307 compounds, 31 compounds for the prediction evaluation test were randomly selected, which correspond to 1/10 of the total compounds, while the remaining 276 were used for training. In the first step of the 2-step DNN prediction method as descriptor selection, two more datasets, which are not duplicate in each set of prediction evaluation data, were prepared from the 307 compounds (Figure S3). The reason for preparing two more datasets in this case is to increase reliability in the descriptor selection and because the result of descriptor selection is used in the first step of the 3-step DNN prediction method. The evaluation of descriptor selection was confirmed with a total of three sets.

Model Performance Evaluation

For the presence/absence prediction in descriptor selection, we checked the evaluation [e.g., positive predictive value (PPV), negative predictive value (NPV), recall, and specificity] to determine whether its minimum value is more than 50% of the cutoff. For numerical prediction in descriptor selection, we checked the evaluation of R2, whether the value is more than 0.5 as the cutoff. For the model evaluation of solubility prediction, we checked R2 and root mean squared error (RMSE).

Confirming Exploration Performance for the Dataset

As dataset evaluation of exploration, we tried leave-one-cluster-out cross-validation[45] (LOCO CV), for which the test data are selected by k-means clustering, while the training data are other clusters; the test and training data are changed k times, alternatingly. In this study, the k of k-means clustering was set to 5, and a random forest algorithm with fivefold cross-validation was used. We performed LOCO CV with shuffled and normalized 276 data, which is the same as the abovementioned training data, including 60 analytical data, seven molecular compositions, and 14 molecular descriptors as all explanatory variables in our model. Then, we compared model performance with each clustered test data and 31 test data, which is the same as the abovementioned test data used in our DNN model.

Dataset of Polymer Compounds

In this study, we tested our HSP prediction models against a total of 23 polymers belonging to seven different skeleton classes, with regard to density, refractive index, 1D 1H NMR, and 1D 13C NMR data. The polymers included six polyacrylates [poly-n-butylacrylate (PBA), polymethylmethacrylate (PMMA), polyethylmethacrylate (PEMA), poly-n-butylmethacrylate (PnBMA), polymethylacrylate (PMA), and polyethylacrylate (PEA)], six polyolefins [polyethylene (PE), polypropylene (PP), polybutadiene, polyisoprene, polychloroprene, and poly-1,1-dimethylethylene], four polyethers [polyethyleneoxide (PEO), polypropylene oxide (PPO), cellulose triacetate (CTA), and polyethersulfone (PES)], two polyesters [polyethyleneterephthalate (PET) and polycaprolactone (PCL)], two polyvinyls [polyvinylacetate (PVAc) and polyvinylchloride (PVC)], two polystyrenes [polystyrene (PS) and polybutadiene-co-styrene], and polysiloxane of polydimethylsiloxane (PDMS). In particular, we tested 22 polymers except polyethylene for dD and dH and 22 polymers except cellulose triacetate for dP based on the data available in the literature. Overall, the literature HSP values were obtained from the “Polymer Handbook”[46] and “PolyInfo Database”,[47] while those for PnBMA and PET were obtained from other papers.[48,49] The analytical data relative to the refractive index and density were obtained from the “Polymer Handbook” and “PolyInfo Database”, while the spectral 1D 1H NMR and 1D 13C NMR values were derived from the “Proton and Carbon NMR Spectra of Polymers”[50] and “PolyInfo Database”.

Results and Discussion

DNN Solubility Prediction 1-Step Method Using Analytical Data as Explanatory Variables

Recently, solubility prediction tools were reported that used structural descriptors or molecular compositions and descriptors, such as RDKIT, CDK, and PaDEL, as training data.[20,23,24] Namely, the input data were based on the chemical formulas, SMILES strings, and so forth; thus, the molecular structure was mostly understood at the linear level. Therefore, this study aimed to predict the solubility (dD, dH, dP, SP, and Log P) of substances using only analytical data as input data (Figure Method1, hereinafter called as the “1-step DNN method”). Subsequently, we tried to predict dD, dH, dP, SP, and Log P using the DNN with the analytical data as training. However, the results were not sufficiently accurate ranging from 0.35 to 0.53 in R2 (Figure a).

Figure 2

Results of R2 and RMSE with test data of each prediction models. (a) Bar chart of each R2 value of solubility predictions, which are for Hansen’s solubility parameters (dD, dH, and dP), SP, and Log P, with the algorithms of the 1-step DNN method, 2-step DNN method, 3-step DNN method, 3-step random forest method, and 3-step XGBoost method. (b) Bar chart of each RMSE value.

DNN Solubility Prediction 2-Step Method

On the basis of previous studies, the prediction accuracy is expected to improve if the molecular information of a substance, such as the molecular composition and molecular descriptors, is used as training data. In this study, our aim was to predict the solubility using only analytical data as input data. Therefore, we attempted to develop a 2-step DNN solubility prediction method, which allows us to predict the solubility from analytical data and predicted intermediate data of molecular composition and molecular descriptors (Figure Method2 and Figure S1a, hereinafter called as the “2-step DNN method”). Concretely, in the first step, we predicted a total of 28 items, namely eight items of molecular composition (described in the Materials and Methods) and 20 items of selected molecular descriptors (described in the Materials and Methods), using the analytical data as training. In these predictions, we used three datasets of test and validation data. One was the same dataset used in the 1-step DNN method. The others were two additional datasets prepared to avoid duplicates in the test set data (described in the Materials and Methods sections; see also Figure S3). Then, we validated a total of three sets in order to ensure reliability for descriptor selection. According to these results, we extracted available items, for each of which the average value of R2 in the three sets was higher than 0.5 for the numerical prediction, and the average of the lowest values of the PPV (%), NPV (%), recall (%), and specificity (%) in the three sets was more than 50% for the presence/absence prediction. As a result, a total of seven items, namely, a molecular composition N item and six molecular descriptor items (PEOE_VSA1, Chi0v, Chi1v, MolMR, TPSA, and Kappa3), were excluded from the training data in the next step of the prediction since they were below the cutoff value as defined above. On the other hand, the remaining 21 items, which comprise the seven molecular composition items, that is, H, C, S, halogens, −OH, −CHO, and >CO and the 14 molecular descriptor items, including NumHeteroatoms, Chi0n, MaxPartialCharge, MinPartialCharge, SlogP_VSA12, SMR_VSA5, SMR_VSA10, HallKierAlpha, VSA_EState9, NumAromaticRings, NumHAcceptors, NumHDonors, RingCount, and NHOHCount, were selected for use in the next step (Table S5a,b). In the second step, we predicted the solubility associated with dD, dH, dP, SP, and Log P using a combination of analytical data, the selected seven molecular compositions, and the selected 14 molecular descriptor items as explanatory variables. In this prediction, we used the same breakdown of the dataset of test and training data as that of the compounds used in the 1-step DNN method. Overall, the values of R2 and RMSE were improved compared to those of the 1-step DNN method, although the values did no exhibit yet satisfactory accuracy except for dD, for which R2 was 0.75 (Figures , S4).

DNN Solubility Prediction 3-Step Method

As shown in previous studies, solubility predictions based on molecular descriptors have already been investigated.[20,23,24] In this study, the prediction with a 2-step DNN method based on analytical data and predicted values of molecular compositions and molecular descriptors as training was found to be superior than that with the 1-step DNN method using only the analytical data as training. However, the prediction accuracy was not adequate. Therefore, we opted for an alternative 3-step DNN solubility prediction method (Figure Method3 and Figure S1b, hereinafter called as the “3-step DNN method”). In the first step, we predicted the selected seven molecular composition items (described in the DNN solubility prediction 2-step method, Table S5a,b), including the number of H and C and the presence or absence of S, halogens, −OH, −CHO, and >CO, using these analytical data as training data. In the second step, we predicted the selected 14 molecular descriptor items (described in the DNN solubility prediction 2-step method, Table S5a,b) using a combination of analytical data and predicted molecular composition. In the third step, we predicted the solubility associated to dD, dH, dP, SP, and Log P with a combination of analytical data, predicted molecular composition, and 14 predicted RDKIT descriptor items. In this prediction, we used the same breakdown of the dataset of test and training data as that used in the 1-step DNN method. The results showed that the R2 values for dD, dH, dP, SP, and Log P were 0.81, 0.61, 0.61, 0.58, and 0.69, respectively, which were enhanced values for all items compared to those of the 2-step DNN method (Figures , S4). The results of R2 values for them with the random forest using the same 3-step method were 0.84, 0.53, 0.50, 0.54, and 0.61, respectively. In addition, the results of R2 values for them with XGBoost using the same 3-step method were 0.83, 0.40, 0.55, 0.53, and 0.64, respectively. Hence, in this study, these results with the DNN were mostly better than those of random forest and XGBoost algorithms. As same as R2, the results for the RMSE values improved for all items. In particular, the predicted dD, which indicates the dispersion energy, showed a relatively high accuracy. It was assumed that this was due to the use of the experimental refractive index value as training data, which is closely related to the weight per unit volume, density, and dD.[51] Actually, the refractive index is the most important factor in the case of the dD prediction (Figure , Table S6a). The dispersion energy dD is a weak intermolecular force that acts even for non-polar molecules, unlike the dipole moment dP. In general, larger molecules exhibit greater intermolecular forces. In other words, the greater the weight per unit volume, the stronger the intermolecular force. Therefore, it can be suggested that a strong relationship occurs between dD and the refractive index. Due to their importance for the dH prediction (Figure , Table S6b), the OH-, NH-, and H-related factors are at higher ranks. We believe that these results can be expected due to hydrogen bonding formation. In the case of the dP prediction (Figure , Table S6c), the partial charge, H, and number of heteroatoms are at higher ranks of importance. Since dP reflects the polarization rate, it can be assumed that the partial charge gives a large contribution to the dP prediction, and the lightest H atom and heteroatoms with unpaired electrons also have a great effect on the permanent dipole. As the accuracy of all R2 and RMSE obtained with the 3-step DNN method is higher than that of the 2-step DNN method and the values of all R2 are >0.5 (Figures , S4), it can be concluded that the solubility prediction of various substances using the 3-step DNN method based only on analytical data as input in the first step is effective. Although we prepared general compounds as a dataset, our models are built from a small dataset, and the prediction performance of our models is not high. Therefore, we re-checked the entire dataset tendency using the LOCO CV method[45] (see the “Confirming Exploration Performance for Dataset” in the Materials and Methods section). Specifically, in this test, we confirmed the availability of our dataset for each solubility prediction using the random forest with the cross-validation method using cluster data as the test data prepared with the k-means method. As a result, the prediction performance using our test data was stable for clusters, as a whole; however, in some cases, there were lower values than clustered data (Figure S5). In particular, it seems that the prediction performance of dH is comparatively low. We consider that it is better to use these models to understand solubility tendency. In contrast, we tried creating solubility prediction models with only molecular descriptors, which are the same 14 items of RDKIT’s descriptors in this study based on SMILES. The method used the same DNN described in the Materials and Methods section. The R2 of dD, dH, dP, SP, and Log P is 0.82, 0.88, 0.91, 0.85, and 0.94, respectively (Figure S6), the performance of which is higher than that of the 3-step DNN for all models. Of note, this approach has been already reported[23,24] and requires SMILES information. As the difference from our approach, which is prediction from analytical data, we consider that our models are more effective in the research stage such as without SMILES information. Moreover, in this study, creating prediction models step by step successfully increased the performance. This approach is similar to the intermediate supervision deep learning algorithm, which has been frequently used in the image-processing field in recent years.[52,53] Therefore, it is possible to adjust this method to our models. In addition, our stepwise method of DNNs in this study obtained models separately. Creating models with the all-in-one method, such as the abovementioned intermediate supervision deep learning, allows us to obtain an effective system and may improve model performance using interlocking models.

Figure 3

Importance of the solubility prediction. As the results of the determination of factor importance for certain attributes, the bar chart shows factors sorted by their importance ranking for each solubility. The checking calculations are performed using the random forest algorithm, which is the same program used in descriptor selection (see also the Materials and Methods section). The descriptors of NHOHCount, NumHAcceptors, NumHDonors, and NumHeteroatoms are the number of −NH and −OH, the number of hydrogen bond acceptors, the number of hydrogen bond donors, and the number of heteroatoms, respectively. The descriptors of Chi0n, MaxPartialCharge, MinPartialCharge, SlogP_VSA12, and SMR_VSA5 are the atomic valence connectivity index, maximum of molecular charge, minimum of molecular charge, MOE-type descriptor of Log P and surface area, and MOE-type descriptor of molar refractivity and surface area, respectively.

Application of the HSP Prediction Model to Polymer Data

The development of novel functional polymeric materials is an important research field that has been actively investigated from several viewpoints, such as the function, environment, and cost reduction. In recent years, a few reports have described solubility prediction approaches, such as machine learning methods using molecular structures, molecular descriptors, and so forth,[24] and molecular dynamics simulations.[6,7] On the other hand, our prediction model differs from other approaches as it exploits only four pieces of analytical data as input, that is, density, refractive index, and top values of the peaks in the 1D 1H NMR and 1D 13C NMR spectra. Therefore, it can also predict the solubility parameters from polymer data if these four pieces of analytical data and solubility values are available as inputs and objective variables, respectively. Therefore, we decided to apply our prediction model to polymer data. In this study, we decided to employ only the previously developed HSP (dD, dH, and dP) models as HSP parameters are the most commonly used factors to test the solubility of substances. We prepared a dataset of 23 common polymers including seven classes for testing, the details of which are mentioned in the Materials and Methods section. Upon predicting dD, dH, and dP, R2 was found to be 0.34, 0.45, and 0.38, respectively (Figure , Table S7). The result obtained for dH was better than that of dD and dP. It was suggested that dH well reflected the chemical shift since the molecular composition and functional group features for dH were comparatively more important factors than for dD and dP (Figure , Table S6). In conclusion, the application of our prediction model to polymers is overall less accurate than for low-molecular weight compounds; however, we believe that it can offer a good estimate of solubility.

Figure 4

Scatter plots of the HSP literature and prediction values for various polymers. Application of our HSP solubility prediction models to 23 common polymers. Here, seven polymer classes are shown using different colors.

Development of a Web tool and Potential Applications

In order to allow for an effective use of our prediction models, we developed a freely accessible MI web tool (http://dmar.riken.jp/matsolca/) using the abovementioned regression models, which provides the calculated values of HSPs, SP, and Log P as solubility information and the calculated substances with approximate HSPs, SP, and analytical data as solubility-related information. In general, the closer the HSP, SP, and analytical data information among two substances is, the easier they are to dissolve. Therefore, this tool provides not only the solubility prediction values, but also three pieces of additional solubility-related information: (1) the nearest HSPs (Figure a), which is the information of the substances with literature HSPs close to the predicted HSPs using the HSP distance,[54] (2) the nearest SP (Figure b), which is the information of the substances with a theoretical SP close to the predicted SP using the SP distance that is the absolute value of the difference between two SP values, and (3) similar analytical data (Figure c), which is the information of the substances with a similar fingerprint of analytical data between their own database and the user’s input data using the Hamming distance.[55]

Figure 5

Positive cycle of solubility prediction in the materials science industry linked by the MI tool. A positive cycle in the materials science industry linked by an effective MI tool is performed as follows: (1) Due to the development of materials, measurement information of chemical substances is accumulated. (2) The accumulated measurement information is utilized for creating prediction models of chemical properties. (3) MI tools are created using the prediction models. (4) Using the prediction models or MI tools for the development of new materials and technologies, predictive technology is growing. Eventually, the development of predictive technology leads to the effective development of new materials and technologies and further accumulation of measurement information. (a) Visualization function of the “nearest HSP search” on the web tool. The orange circle is a predicted location with HSPs. Other blue symbols are literature locations with HSPs. Evaluation of the solubility among two substances uses the HSP distance. (b) Visualization function of the “nearest SP search” on the web tool. The orange circle is a predicted SP value. Other blue symbols are theoretical SP values. (c) Function of similar analytical data search on the web tool. First, the fingerprints of the analytical data of the user’s input data (top) and literature data (database) are prepared. Second, the Hamming distances as evaluation of affinity among two substances are calculated. A substance having a low Hamming distance against the user’s input data can be dissolved with a substance of the input data. Herein, we wish to discuss the versatility of this method since the solubility application range is wide. In this study, we succeeded in predicting the solubility features using only analytical information as input data. As for the process, it was not possible to obtain sufficient accuracy using only the analytical data as training data. However, the accuracy was improved using a 3-step DNN method, which utilizes selected and predicted molecular compositions and molecular descriptors in phase as intermediate data for training. Furthermore, we tried to apply this MI tool based on the HSP prediction models to polymer data. By judging from the R2 and scatter plots, the results did not show high accuracy, but a good correlation occurred between literature and prediction values (Figure ). Therefore, the use of low-molecular weight compounds as training data was sufficient to determine the tendency of solubility of polymers. Furthermore, we created an efficient and user-friendly MI web tool as a solubility calculator based on our prediction models for users of several fields including industry dealing with solubility-related studies (Figure ). Commonly, Log P is used as a hydrophobicity index for determining the solvent selection, bioaccumulation, and biodegradability,[56−58] while HSPs and SP are used for applications based on the solubility of two substances, such as solvent selection/combination, coating techniques, polymer research, and drug development.[49,59,60] Notably, although HSPs are convenient indices for establishing the degree of solubility between two components, the components can be used even in mixtures. For example, a study revealed that the solubility between an insecticide’s solvent as a single component and a cockroach’s body surface as a mixture could be evaluated according to the HSPs.[4] Thus, these solubility-related values are widely applicable. In addition, it can be expected that this solubility prediction tool will be used in the biorefinery area, such as biomass recycling, processing, and molding, and in the blue carbon field, including research and development of sea sediments composed of microalgae and seaweeds as a source of CO2 absorption.[61−63] These land-based and water-based biomasses such as polysaccharides and lignin polymers are generally of low solubility;[64−66] therefore, a solubility prediction approach is useful to extend the industrial application in biorefinery processes. Recently, solubility predictions with several machine learning methods were developed and used.[17,23,24,67] However, in comparison with these predictions, our prediction models of solubility have an application advantage since they feature only analytical information as input data. Therefore, it can be expected that our models will find further application in several research and development fields where the solubility of compounds is important. In recent years, the accuracy of the NMR analysis and simplification of related measurements have been improved;[68−70] therefore, it can also be expected that more simple measurements will contribute to the prediction of physical properties such as solubility parameters. Furthermore, it is also anticipated that the creation of an efficient MI tool may lead to the sustainable development of the materials science industry via a positive cycle (ecosystem) including the accumulation of measurement data of chemical substances, utilization of the data for creating an MI tool of chemical properties, and development of materials science with the data and then again accumulation of measurement data.

3 in total

1. Prediction and Planning of Sports Competition Based on Deep Neural Network.

Authors: Jin Xu
Journal: Comput Intell Neurosci Date: 2022-06-08

2. Materials informatics approach using domain modelling for exploring structure-property relationships of polymers.

Authors: Koki Hara; Shunji Yamada; Atsushi Kurotani; Eisuke Chikayama; Jun Kikuchi
Journal: Sci Rep Date: 2022-06-22 Impact factor: 4.996

Review 3. The exposome paradigm to predict environmental health in terms of systemic homeostasis and resource balance based on NMR data science.

Authors: Jun Kikuchi; Shunji Yamada
Journal: RSC Adv Date: 2021-09-13 Impact factor: 4.036

3 in total