Abdul Karim1, Vahid Riahi1, Avinash Mishra2, M A Hakim Newton3, Abdollah Dehzangi4,5, Thomas Balle6,7, Abdul Sattar3. 1. School of Information Communication Technology, Griffith University, Nathan, Brisbane 4111, Australia. 2. Department of Chemical Engineering, Indian Institute of Technology, Hauz Khas 110016, New Delhi, India. 3. Institute of Integrated and Intelligent Systems, Griffith University, Nathan, Brisbane 4111, Australia. 4. Department of Computer Science, Rutgers University Camden, Camden 08102, New Jersey, United States. 5. Center for Computational and Integrative Biology, Rutgers University Camden, Camden 08102, New Jersey, United States. 6. Sydney Pharmacy School, Faculty of Medicine and Health, The University of Sydney, Camperdown 2006, New South Wales, Australia. 7. Brain and Mind Centre, The University of Sydney, Camperdown 2006, New South Wales, Australia.
Abstract
Toxicity prediction using quantitative structure-activity relationship has achieved significant progress in recent years. However, most existing machine learning methods in toxicity prediction utilize only one type of feature representation and one type of neural network, which essentially restricts their performance. Moreover, methods that use more than one type of feature representation struggle with the aggregation of information captured within the features since they use predetermined aggregation formulas. In this paper, we propose a deep learning framework for quantitative toxicity prediction using five individual base deep learning models and their own base feature representations. We then propose to adopt a meta ensemble approach using another separate deep learning model to perform aggregation of the outputs of the individual base deep learning models. We train our deep learning models in a weighted multitask fashion combining four quantitative toxicity data sets of LD50, IGC50, LC50, and LC50-DM and minimizing the root-mean-square errors. Compared to the current state-of-the-art toxicity prediction method TopTox on LD50, IGC50, and LC50-DM, that is, three out of four data sets, our method, respectively, obtains 5.46, 16.67, and 6.34% better root-mean-square errors, 6.41, 11.80, and 12.16% better mean absolute errors, and 5.21, 7.36, and 2.54% better coefficients of determination. We named our method QuantitativeTox, and our implementation is available from the GitHub repository https://github.com/Abdulk084/QuantitativeTox.
Toxicity prediction using quantitative structure-activity relationship has achieved significant progress in recent years. However, most existing machine learning methods in toxicity prediction utilize only one type of feature representation and one type of neural network, which essentially restricts their performance. Moreover, methods that use more than one type of feature representation struggle with the aggregation of information captured within the features since they use predetermined aggregation formulas. In this paper, we propose a deep learning framework for quantitative toxicity prediction using five individual base deep learning models and their own base feature representations. We then propose to adopt a meta ensemble approach using another separate deep learning model to perform aggregation of the outputs of the individual base deep learning models. We train our deep learning models in a weighted multitask fashion combining four quantitative toxicity data sets of LD50, IGC50, LC50, and LC50-DM and minimizing the root-mean-square errors. Compared to the current state-of-the-art toxicity prediction method TopTox on LD50, IGC50, and LC50-DM, that is, three out of four data sets, our method, respectively, obtains 5.46, 16.67, and 6.34% better root-mean-square errors, 6.41, 11.80, and 12.16% better mean absolute errors, and 5.21, 7.36, and 2.54% better coefficients of determination. We named our method QuantitativeTox, and our implementation is available from the GitHub repository https://github.com/Abdulk084/QuantitativeTox.
Prediction of toxicity
levels of chemical compounds is a great
challenge. Every year a large number of chemical compounds are produced
worldwide, and many of them are suspected to be toxic and are eventually
proved so. Toxicity is the extent to which a chemical compound (or
mixture of compounds) can cause death, injury, or adverse effects
on a living organism and is defined with reference to the quantity
of substance administered or absorbed, the way in which the substance
is administered (inhalation, ingestion, topical application, or injection)
and distributed in time (single or repeated doses), the type and severity
of injury, the time needed to produce the injury, the nature of the
organism(s) affected, and other relevant conditions.[1] The toxic concentration of a compound is ascertained by
experiments measuring end points. Toxicity of a compound can vary
with individuals and their age, gender, and body weight. Thus, different
toxicity indicators are devised to measure the toxicity over a population.
Toxicity estimation, similar to other attributes of chemical compounds,
is done using sophisticated experimental techniques on in
vivo or in vitro models. However, these
techniques are very time consuming and cost intensive. They also raise
ethical concerns because of the involvement of animals or tissue harvested
from animals. To address these issues, in silico methods
(computer-aided methods) have attracted much attention due to their
lower cost and better time efficiency. There exist many in
silico methods, but the quantitative structure activity–property
relationship (QSAR/QSPR) method is one of the most successful ones.
The main rationale behind the QSAR method is that chemical molecules
that are similar in structure should have similar activities and properties.
Therefore, studying the relationship between chemical structures and
biological activities/properties of existing chemicals with known
activities enables the construction of models that can predict activities
of novel chemicals using multivariate statistical methods or machine
learning algorithms.[2]QSAR modeling
using deep learning techniques has become very popular.[3] Many of these methods use two-dimensional (2D)
features calculated from the one-dimensional (1D) representation of
the molecules called SMILES strings. SMILES is a language used in
describing the chemical structure of a molecule as a string of characters.[4] There is a special grammar for SMILES to represent
atoms, types, and chemical bonds between them. SMILES strings are
used for calculating various types of numerical features (e.g., physicochemical
descriptors) and molecular graphs by using different featurization
methods.[5,6] These numerical features can then be used
by traditional machine learning approaches such as K-nearest neighbors
(KNN), support vector machines (SVM), random forest (RF), and fully
connected neural networks (FCNN) to predict activities or properties
of a chemical compound.[7] Besides numerical
features, SMILES strings can also be used to generate molecular graphs
or images, which then can be used in various types of convolutional
neural networks (CNN) to predict molecular activities.[8] Using CNN for molecular graphs or images needs relatively
less domain expertise. It should be noted that SMILES strings can
also be transformed into a vector representation or into their respective
fingerprints. Fingerprints are bit strings composed of 0s and 1s and
can be used in recurrent neural networks (RNN) for molecular activity/property
prediction.[9] Recently, in the area of toxicity
prediction, a specialized type of features called element-specific
topological descriptors (ESTDs) have been used in deep neural networks
and in consensus models by the TopTox[10] to predict toxicity levels of chemical compounds. A recent software
named AdmetSAR uses molecular fingerprints (MFP) to predict toxicity
using the RF, SVM, and KNN models.[11] Another
method, named Hybrid2D, uses joint optimization of shallow neural
networks and decision trees on 2D features to predict toxicity measurement
levels.[12] The performance of all these
quantitative prediction methods is restricted by the specific type
of features or models used in prediction. Other than that, various
ensemble learning methods have also been used in bioinformatics and
QSAR prediction such as DeepHIT,[13] FS-MLKNN,[14] and piRNA Predictor.[15] Ensemble learning is a sophisticated technique of combining features,
which recently is attracting more and more interest in the bioinformatics
field.[14] DeepHIT utilizes a reasonably
diverse feature set, but it still suffers from the lack of an effective
way for combining the outputs of individual models to obtain a robust
performance over a range of metrics. Moreover, DeepHit is optimized
for cardiotoxicity classification to enhance the sensitivity of the
model only. FS-MLKNN and piRNA Predictor use a weighted scoring ensemble
strategy to predict the side-effects of various drugs and transposon-derived
piRNAs, respectively.In this work, for quantitative toxicity
prediction, we hypothesized
that an effective aggregation of various chemical information captured
within various feature representations extracted from SMILES strings
can improve the prediction accuracy. For this purpose, we propose
a three-stage deep learning framework: A featurization stage first
generates a number of base features. A base learning stage then trains
a number of deep learning models, one for each base feature. A meta-learning
stage uses the outputs of the base learning stage as input meta features
and trains a separate deep learning model for meta ensembling and
producing the final output. The five types of base features generated
in the featurization stage are 2D and three-dimensional (3D) descriptors,
molecular graph features (MGFs), extended-connectivity fingerprints
(ECFPs), SMILES vocabulary-based embedded vectors, and fingerprint-based
embedded vectors. The base learning stage comprises five deep learning
models such as 2 × deep neural networks, 1 × graph convolutional
neural network (GCNN), and 2 × 1D CNN. Each of these deep learning
models essentially is for one of the base features. The meta-learning
stage comprises a fully connected deep neural network. We trained
all of our deep learning models in a weighted multitask fashion combining
four quantitative toxicity data sets of LD50, LC50, IGC50, and LC50-DM and minimizing the root
mean square error. Compared to the current state-of-the-art toxicity
prediction method TopTox, on LD50, IGC50, and
LC50-DM, that is three out of four data sets, our method,
respectively, obtained 5.46, 16.67, and 6.34% better root means square
errors, 6.41, 11.80, and 12.16% better mean absolute errors, and 5.21,
7.36, and 2.54% better coefficients of determination. We named our
method QuantitativeTox and our implementation is available from the
GitHub repository https://github.com/Abdulk084/QuantitativeTox.
Materials and Methods
We describe the four quantitative
data sets, the evaluation criteria,
and the weighted loss function used in this work. As shown in Figure , our deep learning framework has three stages: featurization, base
learning, and meta-learning. The featurization stage is to generate
base features which are used in training base learning models. The
output of the base learning models is then used as meta features for
the meta-learning model to produce the final predictions. We describe
the three stages in more detail.
Figure 2
End-to-end
flow diagram of all stages of the proposed framework.
(a) FCNN for 995 2D + 3D descriptors as
base features. (b) CNN
for node vectors of size 51 × 65 and adjacency vectors of size
51 × 51 as base features. (c) FCNN for 1024 ECFP and 881 pubChem
fingerprints as base features. (d) 1D CNN for each of the SMILES embedded
vectors and fingerprint-embedded vectors as base features. (e) Meta
ensemble FCNN for meta features.End-to-end
flow diagram of all stages of the proposed framework.
Featurization Stage
The featurization stage of our
framework consists of various types of featurizers. Each featurizer
takes SMILES strings as input and produces fixed-length base features
as output. Figure shows the five featurizers and their output base features.
2D and 3D
Descriptors (DESC)
A total of 995 high-level
features such as 2D and 3D physicochemical descriptors are computed
using Mordred.[16] The feature names are
described in Table S2 of DeepHIT in the Supporting Information.[13] These features are
numerical in nature and describe the physical and chemical properties
of molecules.[17] 2D descriptors represent
information related to size, shape, distribution of electrons, octanol–water
distribution coefficients (log P) measuring lipophilicity,
nAromAtom denoting the number of aromatic atoms, nHeavyAtom denoting
the number of non-hydrogen atoms, and nBondsT denoting the number
of triple bonds. 3D descriptors relate to the 3D conformation of the
molecules and include the moment of inertia along the Y axis (MOMIY).[17] The value of each descriptor
is normalized between 0 and 1.
Molecular Graph Features
Topological information of
molecules can be intuitively and concisely expressed via molecular
graph features. In this featurizer,[13,18] molecular
graph features such as node vectors and adjacency matrix are computed.
Node vectors represent atoms in the SMILE strings. The adjacency matrix
represents the bonds between atoms. In this study, we extract [51
× 65] node vectors and a [51 × 51] adjacency matrix. Here,
51 is the maximum number of atoms and 65 is the length of the one
hot-encoded feature vector computed from atom descriptors. The details
of these are in Table S3 of DeepHIT in
the Supporting Information.[13]
Molecular Fingerprint
The third
featurizer deals with
fingerprints, where structural features are represented either by
bits in a bit string or by counts in a count vector.[19,20] 1024 extended-connectivity fingerprints with a maximum diameter
parameter of 2 (ECFP2) fingerprints and 881 pubChem fingerprints are
computed using the Python package PyBioMed.[13,21] ECFPs are also referred to as circular fingerprints and are specifically
designed for structure–activity relationship modeling,[22] whereas pubChem fingerprints are mainly designed
for similarity neighboring and similarity searching.[23]
SMILES Strings Embedded Vectors
We compute low-level
features SMILES strings embedded vectors (SeV).[9,24] These
features do not directly describe any biological attribute of the
molecules but have been proven to have a reasonable predictive power
in various QSAR tasks. In the SMILES vectorizer, we create a vocabulary
based on the valid SMILES tokens (procedure described in Supporting Information, S2). A total of 64 unique
tokens are determined based on the training data. Each SMILES string
is converted into a one-hot encoded vector based on the SMILES vocabulary.
Fingerprint-Embedded Vectors
We also compute fingerprint-based
embedded vectors (FPeV).[25] In the fingerprint
vectorizer, SMILES strings are converted into 1024-bit Morgan (or
circular) fingerprints with a radius of 2 via RDKIT.[26] As per an existing technique,[25] we extract fingerprint indices, which are marked 1 in the fingerprints
generated. Thus, we obtain a vector of length 93 where the vector
consists of integers representing the presence of specific substructures
in a molecule. The procedure for fingerprint-embedded vectors is described
as FP2VEC.[25]
Base Learning Stage
The base learning stage consists
of five base deep neural network models, which are trained on respective
base features from the featurization stage. All of the base models
are trained at a learning rate of 10 × 10–4 with an Adam optimizer and 100 epochs with a batch
size of 32. The selection of parameters, hyper-parameters, and network
architecture of base models is inspired by previous published research
in this area.[9,12,13,24,25,27,28] Each of these base
models produces four regression values for four tasks as output; only
the output corresponding to a given task is counted for a given input.
The base models are shown in Figure a–d and are described below. The Keras deep
learning framework and Spektral package are used in developing the
base models.[29,30]
Figure 1
(a) FCNN for 995 2D + 3D descriptors as
base features. (b) CNN
for node vectors of size 51 × 65 and adjacency vectors of size
51 × 51 as base features. (c) FCNN for 1024 ECFP and 881 pubChem
fingerprints as base features. (d) 1D CNN for each of the SMILES embedded
vectors and fingerprint-embedded vectors as base features. (e) Meta
ensemble FCNN for meta features.
Fully Connected Neural
Network for DESC
As shown in Figure a, a fully connected
deep neural network with four hidden layers was trained and validated
on 995 2D + 3D physicochemical descriptors. The input layer consists
of 995 nodes for the 995 physicochemical descriptors. The output layer
has four units, one for each of the four tasks. All layers in the
fully connected neural network for DESC (FCNND) are densely connected
and receive input from all units in the previous layer. The number
of units in each hidden layer decreases gradually, and a ReLU activation[31,32] is applied at the end of each layer except the output layer. Various
regularization parameters such as Kernel regularizer that applies
penalties to the Kernel (main units in the layer) and bias regularizer
that applies penalties to the bias units to reduce overfitting during
optimization[32,33] are used. We also apply a drop-out
rate of 0.5 to the middle layers.[34]
Graph
Convolutional Neural Network for MGF
As shown
in Figure b, a GCNN
was trained using the molecular graph features. GCNN consists of two
graph convolution layers, one global attention pool layer and a dense
layer before the output.[35] Each of the
graph convolutional layers is initiated with 64 channels with a Kernel
regularization value of 0.01 and ReLU activation. The number of channels
in the global attention pool layer is made equal to 1024, the number
of units in the following dense layer.
Fully Connected Neural
Network for MFP
As shown in Figure c, a FCNN is used
with fingerprints as the base feature. Unlike FCNND, a fully connected
neural network for MFP (FCNNF) uses a much smaller number of units
in each layer. Except for the number of units, other parameters are
kept the same as in FCNND. The number of input nodes in the input
layer is kept at 1905 to match the sum of 1024 ECFP fingerprints and
881 pubChem fingerprints.
Convolution 1D Neural Network for SeV and
for FPeV
As shown in Figure d, a variant of a convolution 1D neural network (C1D)
is used for
each of the SMILES-embedded vectors and fingerprint-embedded vectors
as base features. The only difference between the two C1Ds is in the
number of input-layer nodes: 97 for SMILES-embedded vectors and 93
for fingerprint-embedded vectors. Input vectors are converted to a
trainable embedded matrix of size [97 or 93 × 200], which was
then fed into a series of three 1D convolution layers. Each of these
1D convolution layers used ReLU activation, 192 filters with a Kernel
size of 10, 5, and 3, respectively. Two densely connected layers are
also used before the output layer.
Meta-learning Stage
As mentioned before, each base
model produces four outputs for four data sets. The outputs of the
base models are used as meta features for the meta-learning model.
As shown as FCNNM in Figure e, the meta-learning model is an FCNN with an input layer,
an output layer, and two hidden layers. It is trained at a learning
rate of 10 × 10–3 with an
Adam optimizer and 300 epochs with a batch size of 32. The meta-learning
model acts as an ensembling method for the whole framework. Our hypothesis
is that for quantitative toxicity prediction, meta ensembling will
be better able to aggregate the output of individual base models than
the typical average ensembling approaches.
Data Sets Used
We use end points for four quantitative
toxicity data sets (also called tasks). These data sets are LD50, IGC50, LC50, and LC50-DM.[10] These endpoint measures have been used in toxicology
for estimating the toxicity behavior of a given chemical compound
on a given population of a given organism. These measures depend on
the concentration of the compound as well as the duration of exposure
of a given organism to the compound. LC50 and LD50 are the compound concentrations that kill half the members of the
tested animal population. LC50 records the toxicity of
a given compound on fathead minnow, a species of temperate freshwater
fish after 96 h exposure. LC50-DM records the concentration
of a compound in water in milligrams per liter causing 50% population
of Daphnia maga to die after 48 h. LD50 data set has the
lethal dose data for killing 50% rat population when a given compound
is administered orally, given that oral administration could cause
less toxicity than an intravenous route. IGC50 data set
shows the concentration of a chemical compound to arrest the growth
of Tetrahymena pyriformis when exposed
for 40 h. The units of LC50, LC50-DM, and IGC50 end points are – log 10(T mol/L),
where T represents corresponding end points. For
the LD50 set, the units are – log 10(LD50 mol/kg).Original data is available at http://cfpub.epa.gov/ecotox/, http://cfpub.epa.gov/ecotox/, http://chem.sis.nlm.nih.gov/chemidplus/chemidheavy.jsp, and http://chem.sis.nlm.nih.gov/chemidplus/chemidheavy.jsp. We obtained preprocessed train and test sets (as pairs of SMILES
strings and toxicity measures) from TopTox.[10] As shown in Table , these data sets have different sizes ranging from hundreds to thousands
of compounds. In the LD50 data set, some molecules were
removed as part of standardization using RDKIT (http://www.rdkit.org/) and MolVS
(https://molvs.readthedocs.io/en/latest/). The train set from each of the four tasks is randomly split into
four types of subsets: 70% for the base train set, 10% for the base
validation set, 10% for the meta train set, and 10% for the meta validation
set. Next, for each type of subset, the corresponding subsets for
the four tasks are concatenated to obtain a combined set of that type.
The test sets from the four tasks are also concatenated to obtain
a combined test set. These combined train and test sets are available
from the GitHub repository https://github.com/Abdulk084/QuantitativeTox/blob/master/training_multitask.tar.xz. The data split procedure is in Supporting Information, S1.
Table 1
Description of Data Sets after Standardization.
LD50 Data Set Actually Has 5924 in Training Set before
Standardization
data set
train
test
total
base train
base test
meta
train
meta test
LD50
5901
1479
7380
4131
590
590
590
IGC50
1434
358
1792
1005
143
143
143
LC50
659
164
823
464
65
65
65
LC50-DM
283
70
353
199
28
28
28
Evaluation
Criteria
We use three evaluation metrics
for reporting the performance of our method.The first metric
used in this paper is the coefficient of determination r2 shown in eq , where y and , respectively, denote the predicted and
actual values and denotes the mean of actual
values. The
coefficient of determination r2 explains
the relationship between the predicted and actual values. It varies
between 0 and 1, and the higher the value of r2, the better the model’s performance.The second metric is the mean absolute
error (mae) shown in eq . The mae is the mean difference
between the prediction y and the actual observation (ŷ).The third
metric is the root-mean-squared error (rmse) shown in eq . The rmse is the square
root of the mean of squared errors. In rmse, the errors are squared,
so the large errors will have higher weights.
Multitask
Weighted Loss Functions
In machine learning,
we typically optimize for a single task or problem in hand by training
a single model on a specific data set. We fine-tune our model and
optimize all its parameters for one specific task to achieve acceptable
performance. By doing so, we laser focus on a single task and might
be ignoring some relevant signals from other tasks that can improve
the overall performance. If we share the lower level representation
between tasks by training a single model for multiple tasks, we might
be able to generalize better on our original task under consideration.
Multitask learning has been successfully applied across various domains
such as natural language processing, computer vision, and drug discovery.[36] In the context of deep learning, multitask learning
is typically done with either hard or soft parameter sharing of hidden
layers.[36] In hard parameter-sharing, generally
hidden layers between multiple tasks are shared, while some task-specific
layers near the output are kept unshared.[36,37] In soft parameter sharing, on the other hand, each task has its
own model with its own parameters. The distance between the parameters
of the model is then regularized in order to encourage the parameters
to be similar.[36,38] In this paper, we only utilize
hard parameter sharing for multitask quantitative toxicity prediction
model because of less chances of overfitting.Although, as noted
before, we use three measures to report our performances; we only
use the third metric, rmse, as the loss function in the training of
the deep neural networks in both of base and meta-learning stages.
As compared to mae, we found rmse as a loss function with slightly
stable performance for the data sets under consideration. Each learning
model has four outputs for four data sets. Hence, in each learning
model (either base or meta), we compute the rmse value for each output
separately and then take a weighted sum of the rmse values to compute
the total rmse value for all outputs of the learning models. The weight
is w when a given input belongs to the task associated
with the output and 1 when not. This means that for a given input
from a given task, the values of the weight of the loss functions
associated with the other three outputs are all 1. In the experiments,
we try various values from {1, 3, 5, 7, 9} for weight w as shown in eqs –7. Once a w value is selected, the same w is used in all based models and the meta model.In
our proposed multitask learning for the data sets under consideration
in this study, there is an “individual task” and other
“helping tasks” considered together in computing the
loss function of each task at a time. The individual task is the one
which is currently under consideration, and helping tasks are there
to help the individual task only. For instance, in eq , LD50 is the main individual
task we are focusing on, while others are related helping tasks. Intuitively,
while computing the loss, the individual task under consideration
should be given more weight as compared to other related helping tasks
because the individual task data carries more information about itself
than its helping tasks. These helping tasks might provide informative
signals to the individual main task under consideration. However,
the signal provided by the helping tasks would ideally be less informative
than the main individual task’s own signals. That is why more
weight is given to the individual task rather than other helping tasks.
For any number of tasks, the procedure for selecting the value of
“w” remains the same. However, the specific
value of “w” may change.
Results
We select weights to be used in our weighted
loss functions. Then,
we evaluate our base features and meta features. These experimental
results are reported from 10-fold internal cross-validation processes.
Finally, we compare our experimental results with those of the state-of-the-art
toxicity prediction methods using an independent test set.
Weight Selection
in Multitask Loss Function
For these
experiments, we use all components of our method: five types of base
features, five base models, and the meta model. We consider the output
of the meta model. Figure shows that our method achieves the best performance with
weight 5 for LD50, IGC50, and LC50 and with weight 9 for LC50-DM. Henceforth, we will use
these weights in our further experiments.
Figure 3
10-Fold cross-validation
performance with various loss weight values
evaluated against meta validation sets. (a) shows the performance
for the LD50 task for a range of loss weight values, (b)
shows the performance for the IGC50 task for a range of
loss weight values, (c) shows the performance for the LC50 task for a range of loss weight values, and (d) shows the performance
for the LC50-DM task for a range of loss weight values.
10-Fold cross-validation
performance with various loss weight values
evaluated against meta validation sets. (a) shows the performance
for the LD50 task for a range of loss weight values, (b)
shows the performance for the IGC50 task for a range of
loss weight values, (c) shows the performance for the LC50 task for a range of loss weight values, and (d) shows the performance
for the LC50-DM task for a range of loss weight values.
Performance Evaluation of Base Features
Table shows the
performance of the
base models using respective base features and using the base validation
set. The table shows the averages of the 10 runs in the 10-fold cross-validation
process. The standard deviation values are in Supporting Information, S3.
Table 2
10-Fold Cross-Validation
Performance
of the Base Models Using Respective Base Features and Using the Base
Validation Seta
r2
mae
rmse
r2
mae
rmse
base features
LD50
IGC50
DESC
0.553
0.465
0.627
0.788
0.355
0.482
MGF
0.544
0.469
0.630
0.764
0.358
0.502
MFP
0.566
0.457
0.621
0.737
0.417
0.549
FPeV
0.361
0.563
0.754
0.432
0.628
0.800
SeV
0.267
0.597
0.800
0.572
0.532
0.703
LC50
LC50-DM
DESC
0.700
0.574
0.814
0.623
0.792
1.053
MGF
0.568
0.674
0.935
0.544
0.875
1.158
MFP
0.617
0.630
0.892
0.456
0.738
1.008
FPeV
0.369
0.909
1.157
0.297
0.937
1.270
SeV
0.567
0.699
0.962
0.431
0.955
1.289
The standard deviation values are
given in Supporting Information, S3.
The standard deviation values are
given in Supporting Information, S3.As shown in Table , DESC performed better in all three metrics
for LC50 and
IGC50. For LD50 and LC50-DM, MFP
obtains the best results in all metrics except in one case. MGF shows
reasonable performance close to DESC and MFP. The possible reason
behind these performances might be the direct biological relevance
of DESC, MFP, and MGF to activity prediction. SeV and fingerprint-embedded
vector (FPeV) showed the lowest performance for all tasks in the three
metrics possibly because of their no direct biological relevance to
the activity prediction. From Supporting Information, S3, we observe that MGF in LD50 and LC50 and
DESC in IGC50 and LC50-DM obtain the overall
most stable performances with least deviation values. The standard
deviation values are large for FPeV and SeV compared to those for
DESC, MGF, and MFP. Comparatively, smaller data sets such as LC50 and LC50-DM show the least stable results in
terms of standard deviation values.
Performance Evaluation
of Meta Features
Our goal in
this study is to effectively aggregate the chemical information extracted
from various base features for quantitative toxicity data sets so
that the regression performance can be improved. We have five types
of base features, DESC, MGF, MFP, SeV, and FPeV. Hence, we have five
based models. We consider all possible subsets of these five types
of base features. This gives us 25 – 1 = 31 possible
subsets. For each subset, we consider only the base features in the
subset and then use only the corresponding base models. However, for
each of the five subsets having just one type of base feature, in
order to have ensembling effects, we use two base models using the
same base features but trained separately. The set of four outputs
of a subset of features is denoted by M, where i is the number of types
of features in the subset, and j is a unique index
within the subsets all having i types of features.
These outputs are used as the meta features for the meta-learning
models. To summarize, we consider 31 possible meta ensembling models.
For convenience, M is also used to denote the corresponding meta ensembling model.
Moreover, Mi denotes the set of meta models all having i types of features. Table shows 10-fold cross-validation performance of the
31 meta ensembling models. For instance, M1 represents single type
of base features used in creating meta features, whereas M2, M3, M4,
and M5 represent any two, three, four, and five different types of
base features with no repetitions. The corresponding standard deviation
values are in Supporting Information, S4.
Table 3
10-Fold Cross-Validation Results for
Meta Features on Meta Validation Seta
LD50
IGC50
LC50
LC50-DM
meta features
base features
r2
mae
rmse
r2
mae
rmse
r2
mae
rmse
r2
mae
rmse
M1-1
DESC, DESC
0.585
0.455
0.613
0.828
0.329
0.434
0.767
0.547
0.737
0.707
0.636
0.841
M1-2
MGF, MGF
0.550
0.470
0.628
0.803
0.345
0.465
0.662
0.650
0.860
0.736
0.670
0.848
M1-3
MFP, MFP
0.578
0.459
0.620
0.764
0.390
0.518
0.673
0.604
0.831
0.704
0.672
0.847
M1-4
FPeV, FPeV
0.365
0.554
0.742
0.431
0.612
0.781
0.491
0.834
1.082
0.507
0.832
1.106
M1-5
SeV, SeV
0.287
0.589
0.791
0.582
0.519
0.679
0.648
0.641
0.883
0.697
0.697
0.904
M2-1
MGF, MFP
0.637
0.419
0.571
0.846
0.303
0.411
0.736
0.542
0.726
0.714
0.624
0.813
M2-2
MGF, DESC
0.629
0.433
0.580
0.851
0.291
0.397
0.772
0.512
0.710
0.738
0.649
0.871
M2-3
MGF, SeV
0.577
0.458
0.617
0.820
0.331
0.444
0.736
0.573
0.760
0.710
0.600
0.804
M2-4
MGF, FPeV
0.564
0.468
0.629
0.837
0.318
0.426
0.712
0.578
0.784
0.719
0.674
0.861
M2-5
MFP, DESC
0.633
0.418
0.566
0.856
0.313
0.409
0.771
0.509
0.678
0.722
0.723
0.910
M2-6
MFP, SeV
0.593
0.455
0.614
0.779
0.376
0.496
0.702
0.584
0.770
0.745
0.613
0.786
M2-7
MFP, FPeV
0.597
0.447
0.593
0.773
0.381
0.501
0.671
0.616
0.828
0.690
0.676
0.868
M2-8
DESC, SeV
0.586
0.451
0.603
0.817
0.331
0.440
0.715
0.570
0.768
0.779
0.589
0.762
M2-9
DESC, FPeV
0.602
0.449
0.597
0.818
0.323
0.435
0.724
0.568
0.756
0.692
0.688
0.878
M2-10
SeV, FPeV
0.414
0.536
0.720
0.641
0.500
0.656
0.623
0.653
0.864
0.645
0.721
0.933
M3-1
MGF, MFP, DESC
0.647
0.412
0.559
0.866
0.289
0.385
0.770
0.499
0.676
0.812
0.566
0.724
M3-2
MGF, MFP, SeV
0.627
0.431
0.576
0.846
0.316
0.415
0.760
0.505
0.690
0.708
0.643
0.821
M3-3
MGF, MFP, FPeV
0.625
0.426
0.580
0.842
0.316
0.421
0.730
0.569
0.784
0.713
0.618
0.807
M3-4
MGF, DESC, SeV
0.614
0.433
0.577
0.853
0.302
0.403
0.756
0.512
0.693
0.778
0.589
0.756
M3-5
MGF, DESC,
FPeV
0.634
0.431
0.578
0.860
0.306
0.407
0.777
0.514
0.684
0.737
0.607
0.813
M3-6
MGF, SeV, FPeV
0.573
0.458
0.613
0.821
0.339
0.450
0.752
0.535
0.720
0.719
0.637
0.798
M3-7
MFP, DESC, SeV
0.626
0.425
0.580
0.847
0.306
0.410
0.754
0.513
0.702
0.757
0.613
0.786
M3-8
MFP, DESC,
FPeV
0.634
0.423
0.575
0.845
0.315
0.426
0.744
0.530
0.707
0.780
0.622
0.807
M3-9
MFP, SeV, FPeV
0.589
0.448
0.606
0.799
0.367
0.482
0.704
0.547
0.768
0.712
0.689
0.894
M3-10
DESC, SeV, FPeV
0.589
0.440
0.588
0.833
0.323
0.433
0.697
0.542
0.755
0.773
0.551
0.702
M4-1
MGF, MFP, DESC, SeV
0.644
0.413
0.551
0.860
0.290
0.384
0.783
0.516
0.701
0.736
0.558
0.754
M4-2
MGF, MFP, DESC,
FPeV
0.637
0.413
0.555
0.859
0.294
0.400
0.786
0.489
0.678
0.783
0.563
0.731
M4-3
MGF, MFP, SeV, FPeV
0.629
0.421
0.569
0.840
0.313
0.422
0.768
0.544
0.725
0.725
0.618
0.810
M4-4
MGF, DESC,
SeV, FPeV
0.619
0.430
0.578
0.851
0.298
0.403
0.761
0.525
0.718
0.741
0.584
0.784
M4-5
MFP, DESC, SeV, FPV
0.632
0.421
0.569
0.850
0.311
0.411
0.756
0.518
0.691
0.756
0.605
0.811
M5-1
MGF, DESC, SeV, FPeV, MFP
0.652
0.411
0.553
0.860
0.302
0.399
0.785
0.506
0.686
0.784
0.615
0.784
The standard deviation
values are
given in Supporting Information, S4.
The standard deviation
values are
given in Supporting Information, S4.Table shows that
meta features in M3, M4, and M5 show overall better performance for
most of the metrics for all four tasks. M3-1 which represents only
those three features which are directly related to a biological activity
prediction achieves better performance in four cases across three
tasks. Similarly, M3-10, M4-1, M4-2, and M5-1 achieve better performance
in two metrics across one or two tasks. Note that M3-1, M4-1, M4-2,
and M5-1 are associated with at least two direct biologically relevant
base features. Surprisingly, SeV and FPeV which lack in direct biological
relevance help the LC50-DM task to improve r2, mae, and rmse from 0.707, 0.636, and 0.841 to 0.773,
0.551, and 0.702, respectively, when used with DESC. Using SeV and
FPeV individually or together results in the worst performance for
all metrics. Overall, meta features used with meta ensemble models
result in more stable performances as shown in Supporting Information, S4. M2-8 shows the most stable performances
across two tasks in the five performance metrics.
Effectiveness
of Meta Models over Base Models
In order
to investigate the effectiveness of meta models M2–M5 compared
to the best meta models in M1 that have only one type of base feature,
we computed the % improvement shown in Figure . An overall improvement can be observed
in r2, mae, and rmse for all four tasks.
As expected, meta model M2-10, which refers to using SeV and FPeV
only, caused an increase in both types of errors and a decrease in
the correlation substantially. For each task separately, Figure a–dhighlights
the best meta model. We select M5-1 with an improvement of 11.44%
for r2, 9.75% for mae, and 9.74% for rmse
on LD50; M3-1 with an improvement of 4.53% for r2, 12.26% for mae, and 11.39% for rmse on IGC50; M4-2 with an improvement of 2.50% for r2, 10.58% for mae, and 8.00% for rmse on LC50; and M3-1 with an improvement of 10.39% for r2, 11.04% for mae, and 13.92% for rmse on LC50-DM.
We developed our final model with these selected meta models for each
task individually as follows for external independent testing.
Figure 4
Performance
of meta models in M2–M5 in terms of % improvement
over the corresponding best meta models in M1 on the meta validation
set. (a) shows % improvement for LD50 task, (b) shows %
improvement for IGC50 task, (c) shows % improvement for
LC50 task, and (d) shows % improvement for LC50-DM task.
Performance
of meta models in M2–M5 in terms of % improvement
over the corresponding best meta models in M1 on the meta validation
set. (a) shows % improvement for LD50 task, (b) shows %
improvement for IGC50 task, (c) shows % improvement for
LC50 task, and (d) shows % improvement for LC50-DM task.
Comparative Landscape Using
the External Independent Test Sets
We compare our method’s
performance against the state-of-the-art
methods in the literature, for example, the models used in the development
of TEST software,[39] TopTox,[10] and Hybrid2d.[12] The
results are shown in Table . As can be seen, from a total of 12 metrics, the proposed
method obtains the best results in 11 of them. Especially in three
of the four data sets, it dominates other algorithms with significant
margin in all three metrics. The detailed results are discussed below
for each task separately.
Table 4
Comparison of Our Method with Other
Methods Using External Independent Test Sets for All Four Tasks
r2
mae
rmse
r2
mae
Rmse
methods
LD50
IGC50
QuantitativeTox
0.687
0.394
0.537
0.861
0.269
0.365
hierarchical[39]
0.578
0.460
0.650
0.719
0.358
0.539
FDA[39]
0.557
0.474
0.657
0.747
0.337
0.489
group contribution[39]
0.682
0.411
0.575
nearest neighbor[39]
0.557
0.477
0.656
0.6
0.451
0.638
TEST consensus[39]
0.626
0.431
0.594
0.764
0.332
0.475
TopTox[10]
0.653
0.421
0.568
0.802
0.305
0.438
Hybrid2D[12]
0.629
0.810
LC50
LC50-DM
QuantitativeTox
0.792
0.479
0.668
0.808
0.520
0.754
hierarchical[39]
0.71
0.574
0.801
0.695
0.757
0.979
single model[39]
0.704
0.605
0.803
0.697
0.772
0.993
FDA[39]
0.626
0.656
0.915
0.565
0.909
1.19
group contribution[39]
0.686
0.578
0.81
0.671
0.62
0.803
nearest neighbor[39]
0.667
0.649
0.876
0.733
0.745
0.975
TEST consensus[39]
0.728
0.545
0.768
0.739
0.727
0.911
TopTox[10]
0.788
0.446
0.677
0.788
0.592
0.805
Hybrid2D[12]
0.678
0.616
LD50: For this data set,
which is the largest for the four, our proposed method dominates other
methods in all three metrics with r2 =
0.687, mae = 0.394, and rmse = 0.537. The results of the TopTox method,
in all three metrics, are better than those of the TEST software methods
but worse than that of our proposed methods. The improvements of our
method over the state-of-the-art TopTox method are 5.206% for r2, 6.413% for mae, and 5.457% for rmse.IGC50: As can
be seen, for
this data set, TEST consensus obtains the highest r2 of 0.764 among all models in TEST software, while the
TopTox model achieved r2 of 0.802. Our
proposed model obtained r2 of 0.861, which
is better than all six models including TopTox. The proposed model
also obtains better mae and rmse values of 0.269 and 0.365, respectively.
The improvements over the state-of-the-art TopTox method are 7.356%
for r2, 11.803% for mae, and 16.666% for
rmse.LC50: In this task, our proposed
method achieves the best results: 0.792 for r2 and 0.668 for rmse. For mae, our proposed method achieves
the second-best result of 0.479, which is after 0.446 of TopTox.LC50-DM: This
is the smallest
data set among all four tasks. In this task, our method obtains the
best results for all three metrics with r2 = 0.808, mae = 0.520, and rmse = 0.754. The improvements over the
state-of-the-art method TopTox are 2.538% for r2, 12.162% for mae, and 6.335% for rmse.
Chemical Space and Prediction Confidence
A diverse
data set covering a broad sample space is a prerequisite for building
predictive models.[40] For all SMILES strings
in the combined train and test sets, we computed the 2048-bit Morgan
fingerprints using RDKIT.[26] We then use
the t-SNE dimensional reduction technique[41] to convert the 2048 dimensional vectors into two t-SNE dimensions,
which are shown in Figure with a perplexity value of 30. From the
charts, train sets are observed to be covering the test sets for LD50, IGC50, and LC50, indicating the possibility
of highly accurate predictions. However, in LC50-DM, the
train set does not cover the test set, which is more diverse. This
indicates that prediction will be more challenging for this data set.
As shown in Figure and Tables –4, this finding is consistent with the comparatively
lower performance of our proposed method on the LC50-DM
data set than on the other three data sets.
Figure 6
2D t-SNE
charts showing the dispersion and coverage of the chemical
sample space of the train sets over the test sets for (a) LD50, (b) IGC50, (c) LC50, and (d) LC50-DM.
Prediction confidence
interval for full training and external test
sets of (a) LD50, (b) IGC50, (c) LC50, and (d) LC50-DM. The units of LC50, LC50-DM, and IGC50 end points are – log 10(T mol/L), where T represents the corresponding
end points. For the LD50 set, the units are – log 10(LD50 mol/kg).2D t-SNE
charts showing the dispersion and coverage of the chemical
sample space of the train sets over the test sets for (a) LD50, (b) IGC50, (c) LC50, and (d) LC50-DM.For each task separately, we show
the prediction confidence for
full training as well as external test sets in Figure . By taking 2 times the mean absolute error
for external test sets of each task, above 85% of the predictions
are covered in the confidence interval. Prediction confidence intervals
for external test sets of LD50, IGC50, LC50, and LC50-DM are 87.93, 87.98, 87.80, and 85.71%,
respectively. The prediction confidence for the external test set
of IGC50 is the highest, whereas it is the lowest for the
external test set of LC50-DM. The possible reason for the
lowest prediction confidence interval for the external test set of
LC50-DM might be its exceptionally diverse chemical space
as shown in Figure d.
Figure 5
Prediction confidence
interval for full training and external test
sets of (a) LD50, (b) IGC50, (c) LC50, and (d) LC50-DM. The units of LC50, LC50-DM, and IGC50 end points are – log 10(T mol/L), where T represents the corresponding
end points. For the LD50 set, the units are – log 10(LD50 mol/kg).
Discussion
We discuss how meta ensembling and multitask
learning bring new
insights into quantitative toxicity end points and can improve overall
performances using off-the-shelf general features.
Impact of Multitask Weighted
Loss Function
The effectiveness
of multitasking which is being proven to be very useful for quantitative
toxicity prediction[10] can be further improved
by using it in meta-ensembling approaches with multitask
weighted loss function optimization. As shown in Figure , it is clear that a smaller
data set such as LC50-DM requires more weight in multitask
loss optimization. Also, tasks with relatively smaller data sets show
more fluctuations with changing the loss weight value. For instance,
LD50 and IGC50 are relatively larger data sets
than LC50 and LC50-DM and thus they are less
affected by the loss weight value.
Impact of Aggregation of
Various Base Features
Representing
molecules using just one type of feature might not help capture all
information about that molecule. For instance, basic molecular graph
representations do not capture the quantum mechanical structure of
molecules or do not necessarily express the information. Similarly,
the models which use molecular graphs as input like graph convolution
will not be able to distinguish between chiral molecules (molecules
having the same graph structure with a mirror image of each other).
In the case of fingerprints as an input, it is also possible that
different molecules may have identical fingerprints and this will
make it difficult for a model to distinguish between them. Features
with direct biological relevance such as DESC, MGF, and MFP in Table prove very useful
individually as opposed to those features which do not have any direct
biological relevance such as FPeV and SeV. The predictive power of
DESC, MGF, and MFP can be further enhanced when aggregated effectively
with FPeV and SeV as shown in Table and Figure . Specifically, in the case of LC50-DM, the r2, mae, and rmse values are substantially improved
by using SeV and FPeV with DESC (M3-10 meta model). This paves a path
toward the idea of using weak features (SeV and FPeV) along with strong
features (DESC) to further enhance the performance due to the strong
features. The chemical information captured by weak features might
not be significant by itself but can play a vital role when aggregated
with the chemical information extracted using strong features. As
shown in Supporting Information, S4, besides
performance improvements, meta ensembling helps stabilize the results
by producing low standard deviation in most metrics. In the case of
smaller data sets such as LC50-DM, meta models, M1, decrease
the mean standard deviation value substantially as compared to the
mean stand deviation value for base models as shown in Figure of Supporting Information, S4.
Conclusions
Quantitative toxicity
measurement has paramount importance in pharmaceuticals.
Toxicity prediction for chemical compounds recently achieved enhanced
performance in terms of accuracy after the introduction of various
deep learning models in this space. Usually, molecules are represented
by a given type of features and a specific machine learning method
is then used to predict the toxicity. The performance of any quantitative
toxicity prediction method depends upon the specific type of features
and the learning model used. This restricts the overall performance
to a single type of feature and a learning model.In this study,
we have introduced a deep-learning-based framework
called QuantitativeTox for predicting quantitative toxicity end points
or data sets such as LD50, IGC50, LC50, and LC50-DM. Our approach has three stages: generating
base features, training base learning models on the base features,
and training a meta-learning model. We use five types of base features
and then use five base learning models on them. The outputs of the
base learning models are used as the meta features for the meta-learning
model. To support multitask training, each model produces four outputs
for four data sets and the loss function uses a weighted sum over
the data sets.We have found that high-level physicochemical,
low-level fingerprints,
SMILES-embedded vectors, and fingerprint-embedded vectors when used
to create meta features for the meta ensemble model enhance the performance
over a wide range of metrics for the quantitative toxicity prediction
tasks. We evaluated our framework against three main regression metrics
using independent test sets and obtained a robust performance compared
to state-of-the-art methods. Our framework can serve as a robust tool
for quantitative toxicity prediction with a better aggregation strategy
for various features along with individual models and multitasking.
It can also be applied to other similar tasks which are related to
each other like in the current study such as LD50, IGC50, LC50, and LC50-DM..