Literature DB >> 35449985

Novel Solubility Prediction Models: Molecular Fingerprints and Physicochemical Features vs Graph Convolutional Neural Networks.

Sumin Lee¹, Myeonghun Lee¹, Ki-Won Gyak², Sung Dug Kim², Mi-Jeong Kim², Kyoungmin Min¹.

Abstract

Predicting both accurate and reliable solubility values has long been a crucial but challenging task. In this work, surrogated model-based methods were developed to accurately predict the solubility of two molecules (solute and solvent) through machine learning and deep learning. The current study employed two methods: (1) converting molecules into molecular fingerprints and adding optimal physicochemical properties as descriptors and (2) using graph convolutional network (GCN) models to convert molecules into a graph representation and deal with prediction tasks. Then, two prediction tasks were conducted with each method: (1) the solubility value (regression) and (2) the solubility class (classification). The fingerprint-based method clearly demonstrates that high performance is possible by adding simple but significant physicochemical descriptors to molecular fingerprints, while the GCN method shows that it is possible to predict various properties of chemical compounds with relatively simplified features from the graph representation. The developed methodologies provide a comprehensive understanding of constructing a proper model for predicting solubility and can be employed to find suitable solutes and solvents.

Entities: Chemical

Year: 2022 PMID： 35449985 PMCID： PMC9016862 DOI： 10.1021/acsomega.2c00697

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

Solubility (molecular-level mixability via intermolecular interactions) applies to all areas of chemistry, geochemistry, pharmaceutical science, physics, and biochemistry. It is also of great importance in a variety of applications, including environmental prediction, agrochemical design, chemical process design, crystallization, autonomous robotic synthesis using flow chemistry, development of lithographic materials in the semiconductor industry, and drug design, biochemistry, early drug discovery stages, and protein ligand binding in the biomedical field.[1−7] Specifically, aqueous solubility is essential in many industrial, pharmaceutical, and environmental applications because it is critical for the optimization of the initial process of finding drug candidates. Because all drugs in the body show medicinal effects in the form of aqueous solutions, a lower solubility reduces the medicinal effects and bioavailability.[3] While determining whether a material can be dissolved is important, solubility experiments, calculations, and predictions are not easy tasks. First, the complex and multidimensional nature of the melting process makes solubility prediction a difficult task.[4] Various factors, such as chemistry, thermodynamics, kinetics, and morphology, need to be considered.[4,8] In addition, it is almost impossible to collect all the data, which consists of hundreds of thousands of solute–solvent combinations. Although various statistical and machine learning approaches exist, they do not reduce the complexity of the computations because each related feature is either difficult to obtain or requires too many features. Therefore, a simple, accurate, and reliable experimental data-driven solubility prediction method will certainly be welcomed in various fields of research in which solubility prediction is critical. Historically, three approaches have been used to predict solubility:[9] (1) quantum mechanics, (2) the general solubility equation (GSE),[10−14] and (3) machine learning. Quantum mechanical methods, such as the ADF COSMO-RS program,[15,16] calculate the surface charge using density functional theory to obtain the chemical potential, which can be further used to compute thermodynamic equilibrium properties; however, quantum mechanical methods are expensive, and quantitative comparison with experimental results is somewhat difficult to achieve. In addition, a mismatch between the calculated physical properties and experimental data occurs when this method is utilized. The revised GSE utilizes Jorgensen and Duffy’s computational method to reduce the absolute error and make the process simpler and more accurate than the original GSE. However, calculations must be performed for all solvent–solute combinations for validation, which is inefficient and requires considerable resources. As a solution, methods using machine learning and statistics have recently emerged. For example, the quantitative structure–activity and quantitative structure–property relationship (QSAR/QSPR) is used as a general methodology, and various data mining methodologies have been introduced in the field of solubility prediction.[14,17,18] These methodologies have the advantage of providing efficient predictions of solubility because they do not require complex calculations; that is, they are computationally inexpensive. Furthermore, studies have been conducted using deep learning frameworks that utilize neural networks to build predictive models, particularly for aqueous solubility prediction.[19,20] In addition, studies have been conducted to consider a variety of solute–solvent pairs rather than methods that make predictions with one fixed solvent.[4,21] However, methods in previous studies are limited in their ability to predict solubility. They are unable to consider all solute–solvent combinations because, for most studies, the solvent in the training database is fixed. Furthermore, the models lack a comprehensive understanding of appropriate features for representing the solubility and the molecules. Each of these problems causes poor transferability and generality of the prediction model, which requires the time-consuming development of separate models for different systems. In this study, two data-driven methodologies for predicting the solubility of solute–solvent sets are proposed: (1) the MFPCP method, a machine learning method in which the molecular fingerprint (MF) and physicochemical properties (PCP) are utilized as descriptors and (2) the GCN method, in which molecules are converted into graphs and a deep learning framework, a graph convolutional neural network (GCN), is used as a prediction model. To construct the training database, previously published solubility databases were merged and arranged. The MFPCP method focuses on finding computationally inexpensive physicochemical properties that provide the largest increase in the prediction accuracy when used as descriptors. There are several methods to represent molecules. For example, creating features using transformer[22] and using mathematical features[23] are also reasonable methods, but our study sought to use the most fundamental, simple, and clear method above all else. In addition, this study used a molecular fingerprint that best reflects structural simplicity since the purpose is to predict solubility. In addition, when predictive presentation is used, uncertainty would inevitably arise when making predictions for new molecular structures, so a molecular fingerprint, which reflects the molecular structure is used as a representation of the molecules. While previous studies required numerous computationally expensive features such as HOMO, LUMO, LsoluHsolv, and LsolvHsolu, the MFPCP model only considers key physicochemical features that can be easily obtained. Recent works on predicting solubility through machine learning utilized molecular fingerprints in their work.[20,24−26] Previous studies use machine learning/deep learning (ML/DL) methods including random forest (RF), support vector regression (SVR), LightGBM, LASSO and so on,[20,24,26] Naïve-Bayes based models,[25] and also deep learning to predict solubility. While the importance of solubility prediction has been emphasized, various studies have been on solubility prediction has been reported. However, previous studies are only focused on aqueous solubility. In contrast to the structural or physicochemical descriptors used by the MFPCP model, the GCN methodology employs only a simplified molecular-input line-entry system (SMILES) to make predictions; therefore, no other descriptors are required. Moreover, GCNs can be improved by applying various algorithms. For example, DGraphDTA[27] is a multi-input network for drug–target affinity prediction, and the combination of a GCN[28] and graph attention networks (GATs)[29] further improves the model. Recently, the prediction of aqueous solubility using a graph-based message passing network, directed edge graph isomorphism network, and multilevel GCN has been reported.[30−32] However, in the previous studies mentioned above, it is common to predict the solubility using an individual data set and a model for one solvent in the previous studies, so there is a limitation because sufficient data is required. Therefore, in this study, a methodology for predicting the solubility between solute–solvents of various combinations was proposed. We believe that the current work will contribute to overcoming the limitations of previous work by showing better typical prediction performance when working with pretrained solvents and in systems requiring complex computations and by predicting more accurate solubility in a simple and convenient way by using the following methodologies as predictive models.

Methods

A flowchart for solubility prediction is shown in Figure . It briefly describes the database construction, types of descriptors, and surrogate models for regression and classification. The method to verify the performance of the models in this study is as follows: (1) The training and test set were divided 8:2 and performed 99 times with different configurations (random states). Each prediction performance metric was obtained with the average of all states and the maximum and the minimum performance, respectively. (2) The size of the test set was increased incrementally by 10% from 20% to 80%. More detailed information is provided below.

Figure 1

Flowchart of the whole process of solubility prediction. Two methods (Boosting Model, GCN) are used to predict the solubility or to predict whether the solvent material can well dissolve the solute material by three classes (good/fair/bad). Network architectures of GCN are introduced in the Methods. Directions and flow of the data construction process including obtaining property values is mentioned in the Supporting Information (Figure S2).

Solubility Data Sets

The solubility database used in this work integrated several individual data sets from different sources previously used in other studies.[13,21,33−37] A list of these databases is provided in Table S1. Initially, the eight data sets had a total of 38992 data entries. Then these data were prioritized according to the number of solute–solvent combinations (because both methodologies proposed in this study use solute–solvent combination data), and deduplication was performed using the IUPAC International Chemical Identifier (InChIKey). Finally, from the initial 38992 data entries, a total of 17536 data entries were arranged, which contained 12849 solutes and 179 solvents, and were expressed using SMILES and InChIKey (Table S1). All molecules represented in SMILES in this data were converted to molecular fingerprints and graphs using the RDKit[38] and PyTorch Geometric (PyG)[39] packages, respectively. Figure S1 shows the distribution of the solubility (logS) values of each data set and the number of solutes and solvents. The most recurrent solutes were anthracene (0.55%), pyrene (0.44%), and acenaphthene (0.33%), while the most recurrent solvents were water (67.44%), ethanol (4.67%), and benzene (2.80%). The combined percentage of the top five solutes was less than 1%, but the combined percentage of the top five solvents, including water (11827), accounted for approximately 75% of the data set. In addition to predicting solubility values (regression), solubility values were divided into three classes, and classification prediction was performed. Classification was performed for the following reasons. If the purpose is to determine whether substance A will dissolve well in substance B based on certain threshold values, predicting solubility through classification can be more intuitive for producing and analyzing results. Furthermore, predicting solubility through classification should work well because there was an imbalanced combination of solvents and solutes in the data set. The logS value was divided into three classes (low, medium, and high) based on the following criteria:[38]The distribution of the target value (logS) and class distribution of the data set are shown in Figure and Figure S2. They show that the logS values were distributed mostly between −2.5 and 0.0 mol/L. In addition, because the class distribution was well balanced, the training process was facilitated. Furthermore, the boxplots in Figure S2 show the logS distribution of data01, data02, and data03, which indicates that the logS values had almost equal distributions in all three data sets. This is important because an imbalanced logS distribution in one of the data sets can lead to biased prediction results.

Figure 2

Data Identification: (a) logS distribution and (b) class distribution of the data set we used. (c) Table showing the number of solutes and solvents of each data and the total number of data. The table shows that the data size of data03 is small compared to the other two data sets due to data loss in the process of obtaining property values. Nevertheless, we use data03 since it contains all of the property values (MP, MW, volume). For the MFPCP method, physicochemical properties were added to the original data set, in addition to the molecular fingerprints. Because data loss occurred during property acquisition, three variations of the original data set were considered, which depended on the availability of each property. Finally, three databases each of which had different features of the prediction model were utilized. The data construction process is detailed in Figure S3, and the variations are named data01, data02, and data03. The goal was to obtain three property values: molecular weight (MW), volume, and melting point (MP). First, data01 was equal to the original data set, in which no physicochemical properties were added (i.e., only molecular fingerprints were used as descriptors). Second, data02 was the data after obtaining MW and volume.[40] Using RDKit, the data was removed if either property was not calculated, a total of 60 were deleted in the process, and the number of data01 and data02 is the same. Finally, data03 included all three properties along with MP. Data loss from obtaining MP values was considerably higher than that from obtaining the volume and MW values. Each property value and its Pearson’s correlation coefficient are shown in Figure S4. The R values of MW, volume, and MP according to logS were −0.34, −0.38, and −0.09, respectively. In particularly, MP did not appear to correlate with logS. The final three databases utilized each had different features for the prediction model (data01: MACCS fingerprint; data02: MACCS fingerprint, MW, volume; data03: MACCS fingerprint, MW, volume, MP).

MFPCP Method

In this method, molecular fingerprints and physicochemical properties were used as the descriptors. The most common way of representing molecules is the SMILES format. In this work, the chemical structures of the molecules were converted into forms that the computer could decode through a molecular fingerprinting methodology. Converting a molecule into a molecular fingerprint can be performed using various methods, but a 166-dimensional structure called Molecular ACCESS System (MACCS) keys[41,43,44] was used in this study. The MACCS fingerprints can be readily obtained using RDKit and are directly converted from SMILES. The obtained fingerprints of the solutes and solvents were used as the descriptors in the predictive model. Among the various data-driven machine learning methodologies, LightGBM[45] was used. It was chosen because, as a boosting algorithm, it is capable of minimizing prediction error loss and reducing computation time. Because the scale of the three PCP features was different from that of the MACCS fingerprint, a standard scaler was implemented using scikit-learn.[46] GridsearchCV was used for hyperparameter optimization. In particular, the complexity of the model was reduced by adjusting min_child_samples, max_depth, and num_leaves. More details, including the parameter values and code, are included in the Supporting Information. The fingerprinting methodology was applied to the reference data set, which was obtained from a previous study,[21] and the results from using different combinations of features as descriptors were compared (Table S2). Three physicochemical properties (MP, MW, and volume) were proven to be the most important features (Figure ). Hence, they were selected as additional descriptors, and the selected properties were then added to the data set used in this study.

Figure 3

Feature importance plot for (a) LightGBM regression models and (b) LightGBM classification models. Volume, MW, and MP seem to have high feature importance in both regression and classification tasks. (c) Description and dimension of the descriptors used in MFPCP method. The following explains how each property was obtained. The MW and volume values were obtained from RDKit using SMILES.[40] Unlike these, the MP value was not easily obtainable through the python package; therefore, a web crawler was used to scrape data directly from Chemspider,[42] a chemical database site. Units were unified to Celsius, and those without MP data were removed in data03. Despite the considerable amount of data that was removed, data03 was used because the MP has a decisive impact on predicting solubility, as shown in Figure . However, because 1/3 of the data was omitted compared to data01 and data02, all three data sets were preserved and used in the prediction process to clearly compare performance differences due to the descriptors used. Various validations have been conducted to demonstrate that the physicochemical properties are effective in improving the performance.

Molecular Graph Representation

The idea of molecular graph representation lies in graph theory. When chemical compounds are expressed as graphs, atoms correspond to nodes, and the bonds between them correspond to edges without weight. A graph is represented as G = (X, I), where X is a node feature matrix and I is an edge index matrix.[47] The one-hot encoded node feature matrix (N × 75, where N is the number of atoms) consisted of 75 dimensions from the eight features of atoms (Table ). Matrix I was transformed into matrix A for graph convolution. A is an adjacency matrix, where A = 1 if the ith and jth atoms have a bond and Aij = 0 otherwise; that is, an N × N matrix, where N is the number of nodes. The SMILES information from the database was transformed into molecular graphs using the RDKit and PyG packages.

Table 1

Description and Dimension of the Descriptors Considered in the Feature Matrix of GCN

feature	description	dimension
symbol	types of atoms	44
degree	degree of the atom in the molecule	11
implicit H’s	number of implicit H’s on the atom	7
formal charge	formal charge for the molecule	1
radical electrons	number of radical electrons	1
hybridization	sp, sp2, sp3, sp3d, or sp3d2	5
aromatic	whether the atom is part of an aromatic system	1
total no. of H’s	number of bonded hydrogen atoms	5

Graph Convolutional Network

Graph neural networks (GNNs) have shown outstanding performance in various fields[48] because there are no limitations on the size of graphs (consisting of nodes and edges) used as inputs to GNNs; hence, they provide a flexible format for extracting in-depth information from molecules.[49] In chemistry, molecules are modeled as graphs, and their properties need to be verified for drug discovery.[49] GCNs[28] and GATs[29] are widely used GNN models that have gradually been applied for purposes such as drug property prediction and molecular fingerprint generation,[27] thus presenting a new paradigm for virtual screening[50] that accelerates drug discovery with improved prediction accuracy.[51] Furthermore, protein interface prediction using graph convolution suggests the promising potential of GCNs in drug development.[52] Solubility predictions require the conversion of solutes and solvents into graphs to obtain logS values or classes. Therefore, the multi-input GCN model in this study was composed of three types of layers: (1) graph convolution layers, (2) a merge layer, and (3) prediction layers. In the graph convolution layers, the features of each atom were updated by their neighborhoods. The detailed process flow, including the molecule-to-graph conversion in the GCN and construction of model structures, are shown in Figure . The graph convolution algorithms in this study were implemented using PyG (Figure S5) and GCNConv;[28] detailed formulas are described in the Supporting Information. After the graphs of the solutes and solvents passed through the convolution layers, the activations were concatenated in the merge layer. The merged activations then passed through the fully connected layers for prediction. The output of the GCN model was either a value of logS (regression) or the probability of each solubility class (classification).

Figure 4

Graph convolutional network process flow. (a) Schematic description of converting the molecule to graph. The process is done by importing the features of the neighboring element combined with the element into the central element. (b) Overview of the multi-input network model architecture. The solute and the solvent are converted into a SMILES representation, passed into the graph convolution layers, and then merged in the fully connected layer. Finally, we can obtain the prediction results. The activation function of the output layer differs according to the given task. The hyperparameters of the GCN model were tuned to achieve better performance, while at the same time confirming overfitting (Table S3 and Figure S6). The graph convolution layers consisted of four GCNConv layers, each followed by the ReLU activation function, batch normalization, a global pooling layer, and a dropout layer with a rate of 0.1. The prediction layers consisted of four fully connected layers, each similarly followed by the ReLU activation function, batch normalization, and a dropout layer with a rate of 0.1. The dimensions of each layer of the prediction models were 64 (regression) and 512 (classification). The batch size and number of epochs were 256, 200 (regression) and 256, 100 (classification). The Adam optimizer was used with learning rates of 0.005 (regression) and 0.0005 (classification). For classification, the last layer of the prediction layer contained the log softmax function.

Results and Discussion

In this study, regression and classification were employed to measure the performance of both prediction models: MFPCP and GCN. The performance was evaluated by training each model with a training set and evaluating the model with a test set, which were created by splitting the data set at an 8:2 ratio. Figures and 6 show the results obtained using data02, but all three data sets provided meaningful insights. Therefore, the performances of data01, data02, and data03 are evaluated.

Figure 5

Figure 6

GCN results: (a) regression and (b) classification prediction results using GCN. The results below are the results of the prediction using data01. Results tested on all three data sets are available in the Supporting Information.

MFPCP results: (a) regression and (b) classification prediction results using the MFPCP method. The results shown are the results of the prediction using data02. Results tested on all three data sets are available in the Supporting Information. GCN results: (a) regression and (b) classification prediction results using GCN. The results below are the results of the prediction using data01. Results tested on all three data sets are available in the Supporting Information. A detailed description of the metrics used to evaluate the results is provided in the Supporting Information. By using various metrics for each task with three different data sets, each prediction model can be evaluated from diverse perspectives, confirming the performance of the considered approaches.

Performance of the MFPCP Method

The prediction results obtained by using the MFPCP methodology on data02 are shown in Figure . As mentioned in the Methods, three PCPs (MP, MW, and volume) were chosen from among the various properties utilized in the reference study,[21] focusing on those that did not require complex computations and were convenient to obtain. data02 reserved most of the database after including two PCPs; hence, it was chosen as the principal data set. An average R2 value of 0.80 (±0.02) was obtained with mean absolute error (MAE) and root-mean-square error (RMSE) values of 0.65 (±0.03) and 0.96 (±0.05) mol/L, respectively. While a reasonable performance score was achieved with data02, the best score was achieved using all the properties (MACCS fingerprint, volume, MW, and MP). The R2 value for data03 was 0.85 (±0.03) when the training to test set ratio was 8:2 and 99 different configurations (random selection) of the training set were used. Although data01 contained the largest number of data entries, it did not have the highest prediction accuracy (R2 = 0.74 (±0.03)); this validates the importance of including meaningful features. The true vs predicted logS plots for all the data sets are shown in Figure a and Figure S7a. Interestingly, although the number of data entries was drastically reduced from 17536 to 6945 in the process of obtaining all three PCP values, the performance score of the regression task increased from 0.75 to 0.853 (average R2). This demonstrates that better prediction performance can be achieved with smaller amounts of data when significant features, such as volume, MW, and MP, are introduced. For the classification task with data02, the receiver operating characteristic curve–area under the ROC curve (ROC-AUC) curve and the confusion matrix are shown in Figure b. The micro and macro F1 scores were used to evaluate the results of the multiclass classification task more precisely. The accuracy of 77.70% (±1.58%) with a micro F1 value of 0.78 (±0.02) confirmed the high prediction accuracy. It is interesting to note that, as previously observed in the regression task, the best accuracy of 84.16% (±2.56%) and F1 value of 0.84 (±0.03) were achieved using data03, as shown in Figure S8. This means that, as previously discussed for the regression task, including all three physicochemical properties (MP, MW, and volume) increased the prediction capability despite the reduced size of the database. For reference, the results of all three data sets are shown in the Supporting Information; the regression results are shown in Figure S7, and the classification results are shown in Figure S8. Additionally, Figure S9 shows the ROC curves drawn using all three data sets. In practice, the training data set is small, while the test data set is large. Therefore, is important to verify whether the current model is applicable over a wide chemical space by determining how the prediction accuracy is affected by reducing the size of the training set. More specifically, the model was previously trained with 80% of the total data; therefore, the size of the training set was drastically reduced to 20%, and the size of the test sets was increased to 80% to ensure that the model performed sufficiently on a smaller amount of data. The performance variation from changing the portion of the test set from 20% to 80% (data02) is shown in Figure S10. The results clearly validate that the current methodology mostly preserves its prediction accuracy when the portion of the training set is greatly reduced. Specifically, the R2 value was only slightly reduced to 0.75 (±0.01) (a reduction of 78%) when only 20% of data02 was used for training (R2 was 0.8 when the training set was 80% of the data set). This indicates that the chemical space to be explored can be widened without losing the prediction performance. It has also been shown that adding new physicochemical properties, such as the ones given in this work, results in higher performance when predicting solubility and that it is worth obtaining new property values at the expense of losing some parts of the database. From Figure S11, it can be seen that the MFPCP model exhibited almost no variation with the training or test data sets. This means that this model is sufficiently stable to be used in various cases.

Performance of the GCN Architecture

Since data01 and data02 are the same, when data01 was used for training, the GCN model had an R2 value of 0.80 (±0.03), MAE of 0.61 (±0.03) mol/L, and RMSE of 0.96 (±0.06) mol/L for regression and an average accuracy of 78.86% (±1.78%) for classification (Figure ). For classification, the prediction accuracy for the high class (high solubility) was the highest at 81.96% (±2.78%), while the prediction accuracies for the low and medium classes were slightly lower at 81.86% (±4.42%) and 73.57% (±3.77%), respectively. This demonstrates that the current model can be employed to predict any class. Looking at the performance differences between data sets, R2 and the average accuracy for data03 were 0.81 (±0.04) and 80.4% (±3.60%), respectively (Figures S8 and S9). Therefore, consistent prediction performance was observed for both regression and classification, regardless of the data set. In addition, only the amount of data in the data set is relevant for the GCN model because it does not require any PCP values. Compared to the results of the MFPCP method, the R2 value of regression (Figure S7) for the GCN model (0.80) was 0.06 higher than that for the MFPCP model (0.74) using data01 (using only fingerprints). In contrast, using data03 with less data, the R2 value for the MFPCP model (0.85) was 0.04 higher than that for the GCN model (0.81). Similarly, the classification accuracy (Figure S8) for the GCN model (78.86%) was 4.49% higher than that for the MFPCP model (74.37%) using data01. In contrast, using data03, the classification accuracy for the MFPCP model (84.16%) was 3.76% higher than that for the GCN model (80.40%). This suggests that the GCN model using only SMILES is more advantageous if there are more solubility data, while the MFPCP model is more advantageous when there is less data, if MFPCP data are available. Even if the data set is small, the MFPCP model can be highly accurate if the appropriate features are obtained. Therefore, the two models built in this study can be the best model for solubility prediction in a given environment, provided that they are appropriately used according to the scale of the database and the number of features available. To evaluate the stability of the model, additional validation was conducted by varying the ratio of the training to test set from 8:2 to 2:8, as shown in Figure S12. When data01 was divided at a ratio of 2:8 (training to test set), the R2 value was 0.72 (±0.02), and the average accuracy was 72.96% (±1.16%). Both the R2 value and the accuracy were reduced relative to those obtained using the 8:2 ratio (reductions of 0.08 and 5.90%, respectively). This demonstrates the advantage of the GCN model, which exhibits great potential for predicting vast amounts of unknown data by training with a relatively small database. Furthermore, another validation was conducted by randomly splitting the database (8:2 training to test set ratio) with different random states, as shown in Figure S13. The R2 value remained between 0.77 and 0.83, and the average accuracy was between 77.02% and 80.59%. This confirms the stability of the current GCN model. Many solubility prediction models focus only on a single solvent type. In contrast, the concatenated model, which combines the graphs for the solute and solvent, can predict the solubility of a solute in a target solvent with a small portion of its solubility data set. Moreover, the GCN model has the advantage that feature generation is relatively intuitive and does not require additional computation. Because converting chemical compounds into graphs only requires atom and structural information (such as SMILES), the complex process of obtaining experimental or computational descriptors is not necessary. Therefore, if a large amount of solubility data is available, it can be used immediately without additional calculations. As a result, the constructed GCN model is capable of predicting the solubility of any combination of solutes and solvents without additional descriptors. This suggests that the multi-input GCN model has great potential in predicting not only solubility but also many important properties of two targets that can be expressed as graphs.

Additional Validation

For the data construction, the case of using water as the solvent accounts for 67.44% of the total (Figure S1b). Hence, it can be argued that the performance of predicting the aqueous solubility of the two models predominantly determines the overall result. To verify this, the results of regression were further confirmed by reconstructing the data01, data02, and data03 excluding the water solvent. The number of data without water was 5709 (data01, data02) and 3305 (data03), respectively. It is critical to note that the prediction accuracy from both methods reach high reliability even without water. MFPCP method showed an R2 value of 0.675 (±0.02), 0.721 (±0.023), and 0.840 (±0.02) and GCN showed R2 value of 0.70 (±0.05) and 0.74 (±0.08), respectively, for each data set. GCN showed stable performance regardless of the data set, while in the case of MFPCP P, the prediction performance reached highest when data03 was used since more features are used. In addition, in order to verify that the different performance between data set configurations was not because of decreased number of water but a decrease in the number of data, the results were confirmed by matching each number of databases while including water. The MFPCP method showed R2 values of 0.68 (±0.02), 0.787 (±0.02), and 0.842 (±0.02) and for the GCN method and 0.74 (±0.02) and 0.76 (±0.06), respectively, for each data set. Such results are similar to that excluding water, so it can be confirmed that the great performance from the current model is only because of water but still be applicable to various solute–solvent combinations. In addition, most of the solubility prediction results were performed with solvents only for one specific solute; thus, direct performance comparison with the current study is difficult. Therefore, this study additionally tried to compare the model performance by modifying the multi-input process of solute and solvent into one input process for the solute. Boobier et al.[21] reported the results of predicting solubility in water, so we compared the performance of a single-input process by modifying the architectures for the same data set. As a result, the prediction accuracy for the modified MFPCP model is R2 = 0.86, RMSE = 0.82, and for the GCN model with the modified architecture is R2 = 0.87, RMSE = 0.86. Compared to the previous results, it is slightly lower performance (Boobier: R2 = 0.93, RMSE = 0.71). In addition to water, the prediction results of ethanol, benzene, and acetone are summarized in Table S4. It shows that, overall, the performance between models does not exhibit a significant difference. It is essential to note that a previous study implemented the features which must be generated first with additional calculations from the density functional calculations. Such an approach requires a significant calculation time and resources. Furthermore, they obtained the results based on each solvent, meaning that the performance of the multiple solute–solvent combinations is not examined. Again, unlike them, our model does not require additional calculations for both MFPCP and GCN model, and the solute–solvent variety is considered. To summarize, our model adopted the advantage of using a not-resource-demanding descriptor and the convenience of graph generation while maintaining the functionality of assessing multi-input. Moreover, as a multi-input model, it is expected that the models of this study will be able to predict unknown intermolecular solubility by learning the combinations of many different compounds. To further verify this, an experiment was conducted with a small number of solvents and GCN model: for data01, the test set is composed of solvents having a number of 20 or more and 45 or less, respectively, and the corresponding solvent is not present in the training set. As a result of regression, 18 out of 29 test sets showed R2 > 0.80 (as can be seen in Figure S14). This suggests that solubility between unknown compounds can be still predicted using one of our models, as opposed to approaches in previous studies[20,25,26,30−32] where individual models were trained with sufficient data for solvents.

Conclusion

In this study, we proposed two data-driven methodologies for solubility prediction. Previous efforts have mainly focused on predicting solubility in a single solvent or a small number of designated solvents. However, because numerous fields (such as flow chemistry) require knowledge of the relationships between various combinations of solutes and solvents, we designed a model that is not affected by the number of unique solvents. The experiment was conducted with a total of 179 solvents, and two methods were designed: MFPCP and GCN. The MFPCP method demonstrates that adding physicochemical properties to the features used for predicting solubility has several benefits, including high performance. Simply using the molecular fingerprints of the solute and solvent produced moderate performance, but the performance improved when specific physicochemical properties were added to the features. A definite advantage of this model is its low computational cost because the process of obtaining the features is computationally inexpensive. In addition, it can be utilized over a broad chemical space because training with a small data set enables solubility predictions for much larger unknown data sets. Deep learning, particularly GCNs, has attracted considerable attention in the field of cheminformatics. The GCN model proposed in this work considers the atom features and structures of various combinations of solutes and solvents. Because the GCN only requires simple structural descriptors, adding more data sets is relatively straightforward, unlike conventional feature-based machine learning models, which often require additional quantum mechanical calculations to improve their prediction accuracy. Furthermore, the GCN model can be improved by tuning the composition of its architecture and parameters. The GCN model developed in this study is an effective approach for solubility prediction and can assist in new material and drug development processes. The methodologies proposed in this paper only require seconds or minutes of calculation time, making them faster and more effective than computational quantum mechanical modeling method, and they can consider a variety of combinations. In addition, the models are robust in that they are almost unaffected by how the data set is separated and reliably show no difference in performance. In summary, machine learning can be an effective tool for predicting the solubility of various combinations of solutes and solvents.

35 in total

1. Reoptimization of MDL keys for use in drug discovery.

Authors: Joseph L Durant; Burton A Leland; Douglas R Henry; James G Nourse
Journal: J Chem Inf Comput Sci Date: 2002 Nov-Dec

2. Molecular fingerprint recombination: generating hybrid fingerprints for similarity searching from different fingerprint types.

Authors: Britta Nisius; Jürgen Bajorath
Journal: ChemMedChem Date: 2009-11 Impact factor: 3.466

3. Accurate predictions of aqueous solubility of drug molecules via the multilevel graph convolutional network (MGCN) and SchNet architectures.

Authors: Peng Gao; Jie Zhang; Yuzhu Sun; Jianguo Yu
Journal: Phys Chem Chem Phys Date: 2020-10-16 Impact factor: 3.676

4. Prediction of aqueous solubility of drugs and pesticides with COSMO-RS.

Authors: Andreas Klamt; Frank Eckert; Martin Hornig; Michael E Beck; Thorsten Bürger
Journal: J Comput Chem Date: 2002-01-30 Impact factor: 3.376

5. Extracting Predictive Representations from Hundreds of Millions of Molecules.

Authors: Dong Chen; Jiaxin Zheng; Guo-Wei Wei; Feng Pan
Journal: J Phys Chem Lett Date: 2021-11-01 Impact factor: 6.888

6. Energy-based graph convolutional networks for scoring protein docking models.

Authors: Yue Cao; Yang Shen
Journal: Proteins Date: 2020-03-16

7. Forman persistent Ricci curvature (FPRC)-based machine learning models for protein-ligand binding affinity prediction.

Authors: JunJie Wee; Kelin Xia
Journal: Brief Bioinform Date: 2021-11-05 Impact factor: 11.622