Myeonghun Lee1, Kyoungmin Min2. 1. School of Systems Biomedical Science, Soongsil University, 369 Sangdo-ro, Dongjak-gu, Seoul 06978, Republic of Korea. 2. School of Mechanical Engineering, Soongsil University, 369 Sangdo-ro, Dongjak-gu, Seoul 06978, Republic of Korea.
Abstract
The prediction and evaluation of the biodegradability of molecules with computational methods are becoming increasingly important. Among the various methods, quantitative structure-activity relationship (QSAR) models have been demonstrated to predict the ready biodegradation of chemicals but have limited functionality owing to their complex implementation. In this study, we employ the graph convolutional network (GCN) method to overcome these issues. A biodegradability dataset from previous studies was trained to generate prediction models by (i) the QSAR models using the Mordred molecular descriptor calculator and MACCS molecular fingerprint and (ii) the GCN model using molecular graphs. The performance comparison of the methods confirms that the GCN model is more straightforward to implement and more stable; the specificity and sensitivity values are almost identical without specific descriptors or fingerprints. In addition, the performance of the models was further verified by randomly dividing the dataset into 100 different cases of training and test sets and by varying the test set ratio from 20 to 80%. The results of the current study clearly suggest the promise of the GCN model, which can be implemented straightforwardly and can replace conventional QSAR prediction models for various types and properties of molecules.
The prediction and evaluation of the biodegradability of molecules with computational methods are becoming increasingly important. Among the various methods, quantitative structure-activity relationship (QSAR) models have been demonstrated to predict the ready biodegradation of chemicals but have limited functionality owing to their complex implementation. In this study, we employ the graph convolutional network (GCN) method to overcome these issues. A biodegradability dataset from previous studies was trained to generate prediction models by (i) the QSAR models using the Mordred molecular descriptor calculator and MACCS molecular fingerprint and (ii) the GCN model using molecular graphs. The performance comparison of the methods confirms that the GCN model is more straightforward to implement and more stable; the specificity and sensitivity values are almost identical without specific descriptors or fingerprints. In addition, the performance of the models was further verified by randomly dividing the dataset into 100 different cases of training and test sets and by varying the test set ratio from 20 to 80%. The results of the current study clearly suggest the promise of the GCN model, which can be implemented straightforwardly and can replace conventional QSAR prediction models for various types and properties of molecules.
Microorganisms remove
organic materials from the environment through
oxidation, reduction, and hydrolysis. This process is defined as biodegradation
and is an essential method of removing pollutants from the environment.[1] However, the accumulation of non-biodegradable
chemicals poses a potential threat to humans and ecosystems. Some
synthetic plastics, such as polyethylene–starch blends and
polyester polyurethane, are biodegradable, but the most commonly used
plastics are non-biodegradable or require decades to degrade completely.[2] Therefore, it is essential to establish effective
evaluation methods for classifying the potential biodegradability
of organic materials.European legislators included chemical
persistency in Registration,
Evaluation, and Authorization of Chemicals (REACH) for the evaluation
of chemicals.[3] REACH requires biodegradability
assessments of chemicals produced or imported in quantities of more
than a ton per year.[4] However, only 61%
of chemicals produced or imported in quantities of more than 1000
tons per year have information on biodegradability.[5] Therefore, it is critical to assess the biodegradability
of the remaining chemicals. As a potential solution, REACH encourages
the use of quantitative structure–activity relationship (QSAR)
models to predict the biodegradability of compounds.[6]In particular, several QSAR classification models
have recently
been proposed for predicting the “ready biodegradable”
(RB) or “not ready biodegradable” (NRB) class of materials[7] as artificial intelligence (AI) methods have
become more accessible. Previous studies have applied machine learning
algorithms, such as partial least squares discriminant analysis (PLSDA),
multiple linear regression (MLR), logistic regression (LR), naive
Bayes (NB), k-nearest neighbors (kNN), and support vector machines
(SVM), as QSAR models to molecular fingerprints and descriptors to
develop a biodegradability classification model.[1,3,7−9] The datasets, Sn, and
Sp results of previous studies using QSAR models for biodegradability
prediction are summarized in Table S1.
Although various models and datasets are implemented, it is not clear
whether which regressors are superior to the others. In addition,
software has been developed to calculate molecular descriptors, such
as PaDEL-Descriptor[10] and DRAGON,[11] which were used as features to develop QSAR
models.[3,12] The freely available PaDEL-Descriptor can
calculate 1875 molecular descriptors, such as atom-type electrotopological
state descriptors, McGowan volume, molecular linear free energy relation
descriptors, ring counts, and 10 types of fingerprints.[10] DRAGON is another widely used tool that can
calculate 5270 molecular descriptors, but it is difficult to use because
of licensing issues.[13]While it has
been proven that integrating data and ensemble analysis
can increase the reliability of QSAR models, they commonly face problems
with uncertainty for various reasons.[7] For
example, QSAR models make false correlations owing to errors in the
experimental process or do not fully reflect the characteristics of
the data because of the small number of training databases.[14] In addition, they intrinsically require the
generation of appropriate features to train the models. Hence, choosing
the appropriate features is often a complicated issue. For example,
several structural features of molecules (such as halogen, chain branching,
and nitro groups) have been shown to increase biodegradation time,
while others (such as esters, amides, and hydroxyl groups) have been
found to decrease biodegradation time. However, these structural features
cannot be generalized to represent both RB and NRB molecules.[15]To overcome the above-mentioned limitations,
we intend to predict
biodegradability using graph neural networks (GNNs) and compare their
performance with that of QSAR models. In particular, graph convolutional
networks (GCNs)[16] are implemented, which
produce outstanding performance in various fields dealing with molecules,
such as molecular property, activity, interaction, and synthesis prediction.[17] This indicates that such a method can potentially
replace the QSAR model.[18] In chemistry,
the atoms and bonds that make up a molecule can be naturally converted
into graphs by mapping them into sets of nodes and edges,[19] which can be used as input features. This study
compares the procedures and results of QSAR models and the GCN model
for biodegradability classification, demonstrates the advantages and
disadvantages, and discusses the great potential of the GCN model.
The overall procedure of the two models is shown in Figure . The GCN model can be directly
applied by converting the simplified molecular input line entry system
(SMILES) into graphs, whereas the QSAR model is applied by calculating
and selecting molecular descriptors and fingerprints from SMILES.
This schematic indicates that the GCN implementation is less complicated
and requires less information than the QSAR model.
Figure 1
Flowchart conducted in
this study. The dataset (green) consists
of SMILES and RB/NRB class data. The QSAR model (blue) consists of
applying four machine learning algorithms by generating descriptors
and fingerprints that can be calculated from SMILES. The GCN model
(yellow) was composed of converting SMILES into molecular graphs and
using the architecture of Figure c.
Flowchart conducted in
this study. The dataset (green) consists
of SMILES and RB/NRB class data. The QSAR model (blue) consists of
applying four machine learning algorithms by generating descriptors
and fingerprints that can be calculated from SMILES. The GCN model
(yellow) was composed of converting SMILES into molecular graphs and
using the architecture of Figure c.
Figure 2
(a) Visualizing the number of RB/NRB in the dataset. (b)
Features
of atoms constituting the node feature matrix of the molecular graph.
(c) GCN model architecture.
Methods
Biodegradability
Database
This study used “All-Public
set”, an organized and aggregated dataset from five different
sources:[9] the ECHA database,[20] the NITE database,[21] and the training sets of the existing tools VEGA,[22] EPI Suite,[23] and OPERA.[24] Subsequently, CADD Group Chem and User Services[25] and PubChem[26] were
used to verify the accuracy of SMILES. Standardization was performed
using workflows implemented in Konstanz Information Miner (KNIME).[27] Additionally, deduplication was based on standardized
SMILES matching. The datasets represent biodegradable classes with
RB and NRB, as shown in Figure a, which are 1097 and 1733,
respectively.(a) Visualizing the number of RB/NRB in the dataset. (b)
Features
of atoms constituting the node feature matrix of the molecular graph.
(c) GCN model architecture.
Molecular Descriptors
Mordred,[13] a molecular descriptor calculation software that can calculate more
than 1800 two- and three-dimensional descriptors, was used for the
QSAR model. It is an open-source software that is easier to install
and use than other libraries (e.g., Cinfony[28] and ChemoPy[29]) and can be employed flexibly
with a rapid calculation speed. However, 3D structural descriptors,
such as CPSA and MoRSE, can cause complex and non-reproducible optimizations.[3] While 3D descriptors represent valuable chemical
information about molecules, they may be difficult to apply to new
molecules because geometric optimization is necessary. In addition,
they vary between 3D conformers, which can affect the values of the
3D descriptors. Therefore, this study used Mordred to calculate a
total of 1613 two-dimensional values, such as the adjacency matrix,
distance matrix, S log P, and weight.
In addition to descriptors, molecular fingerprints are often used
for biodegradability classification using QSAR models;[1,7,8] in this study, the most widely
used Morgan fingerprint[30] (1024-bit), MACCS
fingerprint[31] (167-bit), and, additionally,
RDKit topological fingerprint (2048-bit)[32] were tested.In contrast, the molecular graph for the GCN
model was transformed in response to the atoms (nodes) and the bonds
between atoms (unweighted edges). The graph is represented by G = (X, A), where X is a 75-dimensional node feature matrix (N × 75, N is the number of nodes) of the chemical
features of atoms, as shown in Figure b, and A represents an adjacency matrix
(N × N), in which A = 1 when the ith and jth atoms are bonded and A = 0 otherwise.
The molecules were represented by SMILES and converted to graphs using
RDKit[32] and PyTorch Geometric (PyG).[33]
Machine Learning Details
Four classification
algorithms
were applied for QSAR modeling: kNN, SVM, random forest (RF), and
gradient boosting (GB). kNN, SVM, and RF have already been implemented
for biodegradability predictions,[3,8,9] and GB, one of the representative boosting models,
was additionally tested. These models were implemented and tuned using
scikit-learn,[34] and their details can be
found in the Supporting Information. In
particular, for the case of SVM and RF, the weight balancing technique
was also applied to consider imbalanced data. In addition, ensemble
analysis was applied to combine the prediction results of the four
different machine learning algorithms. This method can further enhance
the predictive reliability of the model.[35] Because individual models contain a variety of noises, averaging
the prediction results of the implemented models can reduce the overall
noise.[36] Among the ensemble analysis methods,
a soft voting ensemble (SVE) classifier using scikit-learn was applied
in this study. This method selects the highest class by summing the
probabilities predicted by each model.
Graph Convolutional Network
(GCN)
As shown in Figure c, the GCN model
for biodegradability prediction consists of two regions: (i) graph
convolution layers (blue) and (ii) prediction layers (green). In the
graph convolution layers (which consist of four layers), the features
of neighboring atoms and the central atom are calculated by GCNConv,[38] as follows:where à = A + I, I is the identity
matrix of the adjacency matrix, D̃ is the diagonal
node degree matrix, W( is the weight matrix of layer l, H(0) = X, where X is
the feature matrix of the nodes, and H( is the output of layer l. The
prediction layers consist of three fully connected layers and eventually
return prediction results via the log softmax function.
Evaluation
Metrics
To compare the results of the QSAR
and GCN models, the dataset was randomly divided into training and
test sets at a ratio of 8:2, and the proportion of each class was
uniformly distributed (stratified sampling) for each set. The classification
model was evaluated based on the metrics of balanced accuracy (BA),
specificity (Sp), sensitivity (Sn), and error rate (ER) for the prediction
of RB and NRB of molecules. The evaluation metrics are calculated
as follows:where TN and TP are true negative and true
positive values, respectively, while FN and FP are false negative
and false positive values, respectively. Additionally, Sn and Sp are
inversely proportional; therefore, a higher Sn corresponds to a lower
Sp and vice versa.[39]
Results and Discussion
To develop the most accurate surrogate model, it is imperative
to conduct several tests to find appropriate combinations of parameters
in various cases using the four QSAR models and one GCN model. In
this respect, three approaches were used: (i) BA values from 10-fold
cross-validation (CV) were obtained for direct comparison of the machine
learning models, (ii) the average and range of four metrics from 100
different configurations were calculated by randomly dividing the
dataset into training and test sets, and (iii) the test set ratio
was varied from 20 to 80% with 100 different random states, as conducted
in case 2, to further verify the prediction stability. By comparing
these metrics, essential conditions for improving the model performance,
such as several descriptors and algorithms, were selected, and the
final model was determined.
Molecular Descriptors and Fingerprints
The performance
of the QSAR models was first tested when only Morgan, MACCS, and topological
fingerprints were used, as shown in Table S2. Especially, Morgan and MACCS fingerprint methods are widely used
in various molecule-related fields, such as virtual screening[40,41] and target prediction benchmarks.[42,43] Although the
results indicate that there is no significant difference, the MACCS
fingerprint among the three types of molecular fingerprints is slightly
better in the four QSAR models; hence, it was adopted. In addition,
some prior studies[3,8,9] that
classified biodegradability using QSAR models implemented genetic
algorithms (GAs) for feature selection of descriptors obtained from
these fingerprints. Similarly, in this study, a GA was used to find
the optimal subset of features by reducing unnecessary or insignificant
features, thereby potentially increasing the training speed and prediction
accuracy.However, using sklearn-genetic[44] to select the most relevant of the 1613 two-dimensional
descriptors did not significantly improve the model performance, as
shown in Table S3. (The numbers of descriptors
in each model were reduced as follows: kNN, SVM, RF, and GB were implemented
with 642, 50, 832, and 470 descriptors, respectively.) This process
was conducted by obtaining BA scores from 10-fold CV for all models
and descriptors (MACCS and Mordred) with and without GA-based feature
selection (Tables S2 and S3). As a result,
although using any of the three types of molecular fingerprints did
not make a significant performance improvement, the case of using
only MACCS fingerprints showed slightly better performance in common
(BA > 0.82), so only MACCS fingerprints were adopted. Also, as
a result
of comparing the performance of applying GA to the descriptors calculated
by Mordred, kNN and RF use the selected descriptors, and SVM and GB
use all the calculated descriptors. The result of GA also did not
make a significant performance change, but it is important in that
it improved kNN and RF by evaluating and analyzing a small number
of key descriptors. In addition, as shown in Figure S1, the Pearson correlation coefficient between all the descriptors
and the class (RB/NRB) was not exceptionally high or meaningful. For
53 relatively more correlated descriptors (correlation > 0.35)
and
MACCS fingerprints, the BA value in 10-fold CV measured using SVM
was 0.83, which was lower than that using the full descriptors (0.84; Table S3). Therefore, along with the MACCS fingerprint,
kNN and RF for descriptors were selected by GA, and SVM and GB for
all descriptors were adopted as final QSAR models in this study. An
unsupervised learning approach using dimensionality reduction was
attempted for further analysis of the descriptors and fingerprints,
as shown in Figure S2. Using principal
component analysis (PCA), t-distributed stochastic neighbor embedding
(t-SNE),[45] and uniform manifold approximation
and projection (UMAP)[46] algorithms, 1780
dimensions (1615-dimensional descriptors and 167-bit fingerprints)
were reduced to two dimensions and visualized. In addition, the k-means
clustering algorithm was applied to obtain BA values of 0.56, 0.60,
and 0.58 for PCA, t-SNE, and UMAP, respectively. In particular, t-SNE
exhibited a relatively visually distinguishable positional relationship
between the two classes. Although these results are not sufficiently
accurate to fully replace the supervised learning approach, the semi-supervised
approach could still be applied to label molecules with unknown target
properties.
Performance of the QSAR Models
The
performances of
the QSAR and GCN methods were compared using the metrics proposed
above. As shown in Figure and Table , for the three QSAR models except kNN, SVM, RF, and GB exhibit the
same trend in the results: Sp is larger than Sn, and the error range
of Sn (fluctuation magnitude) is relatively larger. This means that
RB is often not correctly predicted, and its prediction capability
could be unstable compared to NRB classification. A previous study
also found that Sp, which is influenced by this prediction imbalance,
is larger than Sn.[3] In the case of kNN,
on the contrary, Sn was higher than Sp, but the overall performance
was not good, especially when the test size was 0.8. The primary cause
for such behavior could be due to the imbalanced dataset, and this
is why this study considered the weight balancing technique for SVM
and RF (the number of RB entries is relatively smaller than that of
NRB, as discussed in Figure a).
Figure 3
For the entire dataset, the composition of the training set and
test set divided by 8 to 2 was randomly different, indicating a result
of 100 times. (a) Results of the QSAR models and (b) result of the
GCN model and a confusion matrix example. The confusion matrices of
the remaining models can be seen in Figure S4.
Table 1
Summary of Results
for Test Sizes
of 0.2 and 0.8 for All Modelsa
test
size of 0.2
test
size of 0.8
model
BA
Sn
Sp
ER
BA
Sn
Sp
ER
kNN
0.83 (±0.04)
0.85 (±0.06)
0.81 (±0.05)
0.18 (±0.04)
0.79
(±0.02)
0.83 (±0.05)
0.75 (±0.04)
0.22 (±0.02)
SVM
0.84 (±0.04)
0.82 (±0.07)
0.86
(±0.05)
0.16 (±0.04)
0.81 (±0.02)
0.78 (±0.05)
0.84 (±0.04)
0.18 (±0.02)
RF
0.83
(±0.04)
0.81 (±0.07)
0.86 (±0.05)
0.16 (±0.04)
0.81 (±0.02)
0.76 (±0.06)
0.85 (±0.04)
0.18
(±0.02)
GB
0.84 (±0.04)
0.78 (±0.07)
0.90 (±0.04)
0.15 (±0.03)
0.81 (±0.02)
0.74
(±0.05)
0.88 (±0.03)
0.17 (±0.02)
GCN
0.84 (±0.05)
0.82 (±0.06)
0.86 (±0.04)
0.16
(±0.04)
0.81 (±0.02)
0.80 (±0.04)
0.81 (±0.04)
0.19 (±0.02)
Details of each result are shown
in Figures and 4.
For the entire dataset, the composition of the training set and
test set divided by 8 to 2 was randomly different, indicating a result
of 100 times. (a) Results of the QSAR models and (b) result of the
GCN model and a confusion matrix example. The confusion matrices of
the remaining models can be seen in Figure S4.Details of each result are shown
in Figures and 4.
Figure 4
Results of measuring
the configuration of the training set and
test set for the entire dataset by dividing it from 8 to 2 to 2 to
8. The results showed the average of randomly performing each configuration
100 times differently.
As shown
in Table , both SVM
and GB show the highest performance among the QSAR models
with a BA value of 0.84 (±0.04). Among the two models, ER of
GB was lower at 0.15 (±0.03), but when Sn and Sp were compared,
SVM values were closer to 0.82 (±0.07) and 0.86 (±0.05),
respectively. These results were the same even when the test size
was 0.8. To further verify the performance of the models, the variation
in the prediction accuracy with the ratio of the test set to the training
set was obtained, as shown in Figure . The results show that, as the proportion of the test
set approaches 80%, the performance of the surrogate models deteriorates
(e.g., kNN was the worst with a BA of 0.79). Furthermore, Sp was generally
larger than Sn, and their difference increased as the size of the
test set increased. This indicates that the intrinsic data imbalance
could cause considerably poorer prediction accuracy when the developed
model was employed for unexplored molecules. To further verify potential
improvement in the prediction accuracy, we adopted the ensemble analysis
model SVE. Ensemble analysis has been used to overcome the limitations
of individual classifiers.[47] This method
aims to increase the accuracy by combining the prediction results
of various models to generate the final prediction results. Compared
with the performance of individual classifiers, ensemble classifiers
have demonstrated better stability and robustness.[48] However, employing the SVE did not improve the model performance
compared to that of SVM, even though all four models were used, as
shown in Figure S3. To provide further
details, the confusion matrices for all the considered models are
shown in Figure S4.Results of measuring
the configuration of the training set and
test set for the entire dataset by dividing it from 8 to 2 to 2 to
8. The results showed the average of randomly performing each configuration
100 times differently.To summarize, as with
many machine learning algorithms, the development
of the QSAR model requires substantial consideration when choosing
proper algorithms and feature selection methods for various descriptors
and fingerprints. Hence, different types of accessible fingerprints
and descriptors have been suggested. However, this increases the complexity
of choosing a suitable model because it requires additional tests.
Moreover, in addition to the results of the molecular descriptor generator,
the physicochemical properties that directly represent the molecules
can improve the performance of the model, but their actual implementation
requires considerable resources owing to the limited availability
and validity of the experimental data.[49]
GCN Model Details
Similar to the QSAR model, the optimal
parameters for the GCN model must be found to obtain the best performance.
In particular, various graph convolution algorithms are available,
and the prediction accuracy depends on the model chosen. GCNConv,
one of the most representative algorithms, was employed in this study,
and its performance was compared with those of SAGEConv[37] and ARMAConv,[51] another
representative algorithm. GCNConv is widely used and has been used
to predict various molecular properties, such as drug–target
affinity,[52] free energy, solubility, and
metabolic stability.[53] As can be seen in Figure S5, it is difficult to judge that the
performance of GCNConv is particularly superior to those of the other
two algorithms, but in this study, the architecture consisting of
four graph convolution layers of GCNConv and three prediction layers
was adopted as it was judged appropriate in terms of computational
cost and performance.[16] When the architecture
was deeper with eight and seven layers, the error ranges in Sp and
Sn are slightly wider (±0.07 and ±0.05, respectively; test
set size was 20%). However, when four graph convolution layers and
three prediction layers were used, Sp and Sn values were narrower
(±0.06 and ±0.04, respectively). This performance suggests
that the implemented GCN model is similarly stable enough for different
algorithm and architecture choices to predict the biodegradability
of molecules.
Performance of the GCN Model
The
GCN model did not
show dramatically improved performance compared to the QSAR models.
The BA value was the same as those of SVM and GB, and the difference
in the ER value was not significant. However, an essential difference
in the GCN model is the distance between Sp and Sn and convenience
of the development process. In particular, there was only a difference
in degree in the QSAR models: Sp was higher than Sn in all models,
which caused instability in the prediction performance. However, especially
when the test size is 0.8, the GCN model had similar BA, Sp, and Sn
values, indicating that the prediction was more stable for both classes
than that of the QSAR models (Table ). Figure and Table confirm that this meaningful result of the GCN model was maintained
even when the ratio of the test set was increased up to 80%, in contrast
to SVM. This implies that the performance of the GCN model is similar
to that of the QSAR models for the overall prediction results, but
when the test size is 0.8, its prediction capability for each class
is more stable and better than that of the four QSAR algorithms. The
confusion matrices are shown in Figure S4 for further comparison.As mentioned earlier, the GCN model
can be developed by defining molecular graphs, applying various convolution
algorithms, restructuring the model architecture, and adjusting the
hyperparameters. However, unlike the QSAR models, the GCN model predicts
the target properties using only the features of the atoms in molecular
graphs; therefore, it does not require descriptors and fingerprints
that are obtained through calculations. Moreover, this means that
complicated feature selection processes are not required. Consequently,
while developing the GCN model requires considerable effort and various
tests to understand and optimize the architecture (similar to the
QSAR models), because it only uses molecular graphs, it is easy to
expand the dataset. Furthermore, it is simpler and more convenient
because it does not perform feature importance analysis; as a result,
it exhibits stable performance.
Conclusions
Because
the biodegradability of materials is a molecular property
directly related to environmental issues, efforts have been made to
predict it. Numerous studies have attempted to solve this problem
using QSAR machine learning algorithms, which have shown sufficient
performance. However, models with many descriptors are less likely
to generalize and have complex interpretations. Furthermore, the conditions
that must be considered for the development of QSAR models and their
performance have highlighted issues that require further improvement.As a solution, this study presented a GCN model that is widely
applied to molecules with high performance. For comparison, four QSAR
models, an ensemble model, and the GCN model were constructed for
datasets organized in a previous study. In addition, molecular descriptors
and fingerprints were calculated and selected to develop the QSAR
models, and convolutional algorithms and model architectures were
selected to develop the GCN model. The results confirmed that the
GCN model does not require descriptors or fingerprints and that, unlike
the QSAR models, the Sp and Sn of the results of the GCN model are
similar to each other, demonstrating stable performance for binary
classification. Furthermore, these advantages of the GCN model suggest
its potential to be an easy and robust solution not only for biodegradability
but also for various QSAR prediction challenges in molecules.