Literature DB >> 34056383

Quantitative Toxicity Prediction via Meta Ensembling of Multitask Deep Learning Models.

Abdul Karim¹, Vahid Riahi¹, Avinash Mishra², M A Hakim Newton³, Abdollah Dehzangi^4,5, Thomas Balle^6,7, Abdul Sattar³.

Abstract

Toxicity prediction using quantitative structure-activity relationship has achieved significant progress in recent years. However, most existing machine learning methods in toxicity prediction utilize only one type of feature representation and one type of neural network, which essentially restricts their performance. Moreover, methods that use more than one type of feature representation struggle with the aggregation of information captured within the features since they use predetermined aggregation formulas. In this paper, we propose a deep learning framework for quantitative toxicity prediction using five individual base deep learning models and their own base feature representations. We then propose to adopt a meta ensemble approach using another separate deep learning model to perform aggregation of the outputs of the individual base deep learning models. We train our deep learning models in a weighted multitask fashion combining four quantitative toxicity data sets of LD50, IGC50, LC50, and LC50-DM and minimizing the root-mean-square errors. Compared to the current state-of-the-art toxicity prediction method TopTox on LD50, IGC50, and LC50-DM, that is, three out of four data sets, our method, respectively, obtains 5.46, 16.67, and 6.34% better root-mean-square errors, 6.41, 11.80, and 12.16% better mean absolute errors, and 5.21, 7.36, and 2.54% better coefficients of determination. We named our method QuantitativeTox, and our implementation is available from the GitHub repository https://github.com/Abdulk084/QuantitativeTox.

Entities: Chemical Disease Gene Species

Year: 2021 PMID： 34056383 PMCID： PMC8154128 DOI： 10.1021/acsomega.1c01247

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

Prediction of toxicity levels of chemical compounds is a great challenge. Every year a large number of chemical compounds are produced worldwide, and many of them are suspected to be toxic and are eventually proved so. Toxicity is the extent to which a chemical compound (or mixture of compounds) can cause death, injury, or adverse effects on a living organism and is defined with reference to the quantity of substance administered or absorbed, the way in which the substance is administered (inhalation, ingestion, topical application, or injection) and distributed in time (single or repeated doses), the type and severity of injury, the time needed to produce the injury, the nature of the organism(s) affected, and other relevant conditions.[1] The toxic concentration of a compound is ascertained by experiments measuring end points. Toxicity of a compound can vary with individuals and their age, gender, and body weight. Thus, different toxicity indicators are devised to measure the toxicity over a population. Toxicity estimation, similar to other attributes of chemical compounds, is done using sophisticated experimental techniques on in vivo or in vitro models. However, these techniques are very time consuming and cost intensive. They also raise ethical concerns because of the involvement of animals or tissue harvested from animals. To address these issues, in silico methods (computer-aided methods) have attracted much attention due to their lower cost and better time efficiency. There exist many in silico methods, but the quantitative structure activity–property relationship (QSAR/QSPR) method is one of the most successful ones. The main rationale behind the QSAR method is that chemical molecules that are similar in structure should have similar activities and properties. Therefore, studying the relationship between chemical structures and biological activities/properties of existing chemicals with known activities enables the construction of models that can predict activities of novel chemicals using multivariate statistical methods or machine learning algorithms.[2] QSAR modeling using deep learning techniques has become very popular.[3] Many of these methods use two-dimensional (2D) features calculated from the one-dimensional (1D) representation of the molecules called SMILES strings. SMILES is a language used in describing the chemical structure of a molecule as a string of characters.[4] There is a special grammar for SMILES to represent atoms, types, and chemical bonds between them. SMILES strings are used for calculating various types of numerical features (e.g., physicochemical descriptors) and molecular graphs by using different featurization methods.[5,6] These numerical features can then be used by traditional machine learning approaches such as K-nearest neighbors (KNN), support vector machines (SVM), random forest (RF), and fully connected neural networks (FCNN) to predict activities or properties of a chemical compound.[7] Besides numerical features, SMILES strings can also be used to generate molecular graphs or images, which then can be used in various types of convolutional neural networks (CNN) to predict molecular activities.[8] Using CNN for molecular graphs or images needs relatively less domain expertise. It should be noted that SMILES strings can also be transformed into a vector representation or into their respective fingerprints. Fingerprints are bit strings composed of 0s and 1s and can be used in recurrent neural networks (RNN) for molecular activity/property prediction.[9] Recently, in the area of toxicity prediction, a specialized type of features called element-specific topological descriptors (ESTDs) have been used in deep neural networks and in consensus models by the TopTox[10] to predict toxicity levels of chemical compounds. A recent software named AdmetSAR uses molecular fingerprints (MFP) to predict toxicity using the RF, SVM, and KNN models.[11] Another method, named Hybrid2D, uses joint optimization of shallow neural networks and decision trees on 2D features to predict toxicity measurement levels.[12] The performance of all these quantitative prediction methods is restricted by the specific type of features or models used in prediction. Other than that, various ensemble learning methods have also been used in bioinformatics and QSAR prediction such as DeepHIT,[13] FS-MLKNN,[14] and piRNA Predictor.[15] Ensemble learning is a sophisticated technique of combining features, which recently is attracting more and more interest in the bioinformatics field.[14] DeepHIT utilizes a reasonably diverse feature set, but it still suffers from the lack of an effective way for combining the outputs of individual models to obtain a robust performance over a range of metrics. Moreover, DeepHit is optimized for cardiotoxicity classification to enhance the sensitivity of the model only. FS-MLKNN and piRNA Predictor use a weighted scoring ensemble strategy to predict the side-effects of various drugs and transposon-derived piRNAs, respectively. In this work, for quantitative toxicity prediction, we hypothesized that an effective aggregation of various chemical information captured within various feature representations extracted from SMILES strings can improve the prediction accuracy. For this purpose, we propose a three-stage deep learning framework: A featurization stage first generates a number of base features. A base learning stage then trains a number of deep learning models, one for each base feature. A meta-learning stage uses the outputs of the base learning stage as input meta features and trains a separate deep learning model for meta ensembling and producing the final output. The five types of base features generated in the featurization stage are 2D and three-dimensional (3D) descriptors, molecular graph features (MGFs), extended-connectivity fingerprints (ECFPs), SMILES vocabulary-based embedded vectors, and fingerprint-based embedded vectors. The base learning stage comprises five deep learning models such as 2 × deep neural networks, 1 × graph convolutional neural network (GCNN), and 2 × 1D CNN. Each of these deep learning models essentially is for one of the base features. The meta-learning stage comprises a fully connected deep neural network. We trained all of our deep learning models in a weighted multitask fashion combining four quantitative toxicity data sets of LD50, LC50, IGC50, and LC50-DM and minimizing the root mean square error. Compared to the current state-of-the-art toxicity prediction method TopTox, on LD50, IGC50, and LC50-DM, that is three out of four data sets, our method, respectively, obtained 5.46, 16.67, and 6.34% better root means square errors, 6.41, 11.80, and 12.16% better mean absolute errors, and 5.21, 7.36, and 2.54% better coefficients of determination. We named our method QuantitativeTox and our implementation is available from the GitHub repository https://github.com/Abdulk084/QuantitativeTox.

Materials and Methods

We describe the four quantitative data sets, the evaluation criteria, and the weighted loss function used in this work. As shown in Figure , our deep learning framework has three stages: featurization, base learning, and meta-learning. The featurization stage is to generate base features which are used in training base learning models. The output of the base learning models is then used as meta features for the meta-learning model to produce the final predictions. We describe the three stages in more detail.

Figure 2

End-to-end flow diagram of all stages of the proposed framework.

(a) FCNN for 995 2D + 3D descriptors as base features. (b) CNN for node vectors of size 51 × 65 and adjacency vectors of size 51 × 51 as base features. (c) FCNN for 1024 ECFP and 881 pubChem fingerprints as base features. (d) 1D CNN for each of the SMILES embedded vectors and fingerprint-embedded vectors as base features. (e) Meta ensemble FCNN for meta features. End-to-end flow diagram of all stages of the proposed framework.

Featurization Stage

The featurization stage of our framework consists of various types of featurizers. Each featurizer takes SMILES strings as input and produces fixed-length base features as output. Figure shows the five featurizers and their output base features.

2D and 3D Descriptors (DESC)

A total of 995 high-level features such as 2D and 3D physicochemical descriptors are computed using Mordred.[16] The feature names are described in Table S2 of DeepHIT in the Supporting Information.[13] These features are numerical in nature and describe the physical and chemical properties of molecules.[17] 2D descriptors represent information related to size, shape, distribution of electrons, octanol–water distribution coefficients (log P) measuring lipophilicity, nAromAtom denoting the number of aromatic atoms, nHeavyAtom denoting the number of non-hydrogen atoms, and nBondsT denoting the number of triple bonds. 3D descriptors relate to the 3D conformation of the molecules and include the moment of inertia along the Y axis (MOMIY).[17] The value of each descriptor is normalized between 0 and 1.

Molecular Graph Features

Topological information of molecules can be intuitively and concisely expressed via molecular graph features. In this featurizer,[13,18] molecular graph features such as node vectors and adjacency matrix are computed. Node vectors represent atoms in the SMILE strings. The adjacency matrix represents the bonds between atoms. In this study, we extract [51 × 65] node vectors and a [51 × 51] adjacency matrix. Here, 51 is the maximum number of atoms and 65 is the length of the one hot-encoded feature vector computed from atom descriptors. The details of these are in Table S3 of DeepHIT in the Supporting Information.[13]

Molecular Fingerprint

The third featurizer deals with fingerprints, where structural features are represented either by bits in a bit string or by counts in a count vector.[19,20] 1024 extended-connectivity fingerprints with a maximum diameter parameter of 2 (ECFP2) fingerprints and 881 pubChem fingerprints are computed using the Python package PyBioMed.[13,21] ECFPs are also referred to as circular fingerprints and are specifically designed for structure–activity relationship modeling,[22] whereas pubChem fingerprints are mainly designed for similarity neighboring and similarity searching.[23]

SMILES Strings Embedded Vectors

We compute low-level features SMILES strings embedded vectors (SeV).[9,24] These features do not directly describe any biological attribute of the molecules but have been proven to have a reasonable predictive power in various QSAR tasks. In the SMILES vectorizer, we create a vocabulary based on the valid SMILES tokens (procedure described in Supporting Information, S2). A total of 64 unique tokens are determined based on the training data. Each SMILES string is converted into a one-hot encoded vector based on the SMILES vocabulary.

Fingerprint-Embedded Vectors

We also compute fingerprint-based embedded vectors (FPeV).[25] In the fingerprint vectorizer, SMILES strings are converted into 1024-bit Morgan (or circular) fingerprints with a radius of 2 via RDKIT.[26] As per an existing technique,[25] we extract fingerprint indices, which are marked 1 in the fingerprints generated. Thus, we obtain a vector of length 93 where the vector consists of integers representing the presence of specific substructures in a molecule. The procedure for fingerprint-embedded vectors is described as FP2VEC.[25]

Base Learning Stage

The base learning stage consists of five base deep neural network models, which are trained on respective base features from the featurization stage. All of the base models are trained at a learning rate of 10 × 10–4 with an Adam optimizer and 100 epochs with a batch size of 32. The selection of parameters, hyper-parameters, and network architecture of base models is inspired by previous published research in this area.[9,12,13,24,25,27,28] Each of these base models produces four regression values for four tasks as output; only the output corresponding to a given task is counted for a given input. The base models are shown in Figure a–d and are described below. The Keras deep learning framework and Spektral package are used in developing the base models.[29,30]

Figure 1

Fully Connected Neural Network for DESC

As shown in Figure a, a fully connected deep neural network with four hidden layers was trained and validated on 995 2D + 3D physicochemical descriptors. The input layer consists of 995 nodes for the 995 physicochemical descriptors. The output layer has four units, one for each of the four tasks. All layers in the fully connected neural network for DESC (FCNND) are densely connected and receive input from all units in the previous layer. The number of units in each hidden layer decreases gradually, and a ReLU activation[31,32] is applied at the end of each layer except the output layer. Various regularization parameters such as Kernel regularizer that applies penalties to the Kernel (main units in the layer) and bias regularizer that applies penalties to the bias units to reduce overfitting during optimization[32,33] are used. We also apply a drop-out rate of 0.5 to the middle layers.[34]

Graph Convolutional Neural Network for MGF

As shown in Figure b, a GCNN was trained using the molecular graph features. GCNN consists of two graph convolution layers, one global attention pool layer and a dense layer before the output.[35] Each of the graph convolutional layers is initiated with 64 channels with a Kernel regularization value of 0.01 and ReLU activation. The number of channels in the global attention pool layer is made equal to 1024, the number of units in the following dense layer.

Fully Connected Neural Network for MFP

As shown in Figure c, a FCNN is used with fingerprints as the base feature. Unlike FCNND, a fully connected neural network for MFP (FCNNF) uses a much smaller number of units in each layer. Except for the number of units, other parameters are kept the same as in FCNND. The number of input nodes in the input layer is kept at 1905 to match the sum of 1024 ECFP fingerprints and 881 pubChem fingerprints.

Convolution 1D Neural Network for SeV and for FPeV

As shown in Figure d, a variant of a convolution 1D neural network (C1D) is used for each of the SMILES-embedded vectors and fingerprint-embedded vectors as base features. The only difference between the two C1Ds is in the number of input-layer nodes: 97 for SMILES-embedded vectors and 93 for fingerprint-embedded vectors. Input vectors are converted to a trainable embedded matrix of size [97 or 93 × 200], which was then fed into a series of three 1D convolution layers. Each of these 1D convolution layers used ReLU activation, 192 filters with a Kernel size of 10, 5, and 3, respectively. Two densely connected layers are also used before the output layer.

Meta-learning Stage

As mentioned before, each base model produces four outputs for four data sets. The outputs of the base models are used as meta features for the meta-learning model. As shown as FCNNM in Figure e, the meta-learning model is an FCNN with an input layer, an output layer, and two hidden layers. It is trained at a learning rate of 10 × 10–3 with an Adam optimizer and 300 epochs with a batch size of 32. The meta-learning model acts as an ensembling method for the whole framework. Our hypothesis is that for quantitative toxicity prediction, meta ensembling will be better able to aggregate the output of individual base models than the typical average ensembling approaches.

Data Sets Used

We use end points for four quantitative toxicity data sets (also called tasks). These data sets are LD50, IGC50, LC50, and LC50-DM.[10] These endpoint measures have been used in toxicology for estimating the toxicity behavior of a given chemical compound on a given population of a given organism. These measures depend on the concentration of the compound as well as the duration of exposure of a given organism to the compound. LC50 and LD50 are the compound concentrations that kill half the members of the tested animal population. LC50 records the toxicity of a given compound on fathead minnow, a species of temperate freshwater fish after 96 h exposure. LC50-DM records the concentration of a compound in water in milligrams per liter causing 50% population of Daphnia maga to die after 48 h. LD50 data set has the lethal dose data for killing 50% rat population when a given compound is administered orally, given that oral administration could cause less toxicity than an intravenous route. IGC50 data set shows the concentration of a chemical compound to arrest the growth of Tetrahymena pyriformis when exposed for 40 h. The units of LC50, LC50-DM, and IGC50 end points are – log 10(T mol/L), where T represents corresponding end points. For the LD50 set, the units are – log 10(LD50 mol/kg). Original data is available at http://cfpub.epa.gov/ecotox/, http://cfpub.epa.gov/ecotox/, http://chem.sis.nlm.nih.gov/chemidplus/chemidheavy.jsp, and http://chem.sis.nlm.nih.gov/chemidplus/chemidheavy.jsp. We obtained preprocessed train and test sets (as pairs of SMILES strings and toxicity measures) from TopTox.[10] As shown in Table , these data sets have different sizes ranging from hundreds to thousands of compounds. In the LD50 data set, some molecules were removed as part of standardization using RDKIT (http://www.rdkit.org/) and MolVS (https://molvs.readthedocs.io/en/latest/). The train set from each of the four tasks is randomly split into four types of subsets: 70% for the base train set, 10% for the base validation set, 10% for the meta train set, and 10% for the meta validation set. Next, for each type of subset, the corresponding subsets for the four tasks are concatenated to obtain a combined set of that type. The test sets from the four tasks are also concatenated to obtain a combined test set. These combined train and test sets are available from the GitHub repository https://github.com/Abdulk084/QuantitativeTox/blob/master/training_multitask.tar.xz. The data split procedure is in Supporting Information, S1.

Table 1

Description of Data Sets after Standardization. LD50 Data Set Actually Has 5924 in Training Set before Standardization

data set	train	test	total	base train	base test	meta train	meta test
LD₅₀	5901	1479	7380	4131	590	590	590
IGC₅₀	1434	358	1792	1005	143	143	143
LC₅₀	659	164	823	464	65	65	65
LC₅₀-DM	283	70	353	199	28	28	28

Evaluation Criteria

We use three evaluation metrics for reporting the performance of our method. The first metric used in this paper is the coefficient of determination r2 shown in eq , where y and , respectively, denote the predicted and actual values and denotes the mean of actual values. The coefficient of determination r2 explains the relationship between the predicted and actual values. It varies between 0 and 1, and the higher the value of r2, the better the model’s performance. The second metric is the mean absolute error (mae) shown in eq . The mae is the mean difference between the prediction y and the actual observation (ŷ). The third metric is the root-mean-squared error (rmse) shown in eq . The rmse is the square root of the mean of squared errors. In rmse, the errors are squared, so the large errors will have higher weights.

Multitask Weighted Loss Functions

In machine learning, we typically optimize for a single task or problem in hand by training a single model on a specific data set. We fine-tune our model and optimize all its parameters for one specific task to achieve acceptable performance. By doing so, we laser focus on a single task and might be ignoring some relevant signals from other tasks that can improve the overall performance. If we share the lower level representation between tasks by training a single model for multiple tasks, we might be able to generalize better on our original task under consideration. Multitask learning has been successfully applied across various domains such as natural language processing, computer vision, and drug discovery.[36] In the context of deep learning, multitask learning is typically done with either hard or soft parameter sharing of hidden layers.[36] In hard parameter-sharing, generally hidden layers between multiple tasks are shared, while some task-specific layers near the output are kept unshared.[36,37] In soft parameter sharing, on the other hand, each task has its own model with its own parameters. The distance between the parameters of the model is then regularized in order to encourage the parameters to be similar.[36,38] In this paper, we only utilize hard parameter sharing for multitask quantitative toxicity prediction model because of less chances of overfitting. Although, as noted before, we use three measures to report our performances; we only use the third metric, rmse, as the loss function in the training of the deep neural networks in both of base and meta-learning stages. As compared to mae, we found rmse as a loss function with slightly stable performance for the data sets under consideration. Each learning model has four outputs for four data sets. Hence, in each learning model (either base or meta), we compute the rmse value for each output separately and then take a weighted sum of the rmse values to compute the total rmse value for all outputs of the learning models. The weight is w when a given input belongs to the task associated with the output and 1 when not. This means that for a given input from a given task, the values of the weight of the loss functions associated with the other three outputs are all 1. In the experiments, we try various values from {1, 3, 5, 7, 9} for weight w as shown in eqs –7. Once a w value is selected, the same w is used in all based models and the meta model. In our proposed multitask learning for the data sets under consideration in this study, there is an “individual task” and other “helping tasks” considered together in computing the loss function of each task at a time. The individual task is the one which is currently under consideration, and helping tasks are there to help the individual task only. For instance, in eq , LD50 is the main individual task we are focusing on, while others are related helping tasks. Intuitively, while computing the loss, the individual task under consideration should be given more weight as compared to other related helping tasks because the individual task data carries more information about itself than its helping tasks. These helping tasks might provide informative signals to the individual main task under consideration. However, the signal provided by the helping tasks would ideally be less informative than the main individual task’s own signals. That is why more weight is given to the individual task rather than other helping tasks. For any number of tasks, the procedure for selecting the value of “w” remains the same. However, the specific value of “w” may change.

Results

We select weights to be used in our weighted loss functions. Then, we evaluate our base features and meta features. These experimental results are reported from 10-fold internal cross-validation processes. Finally, we compare our experimental results with those of the state-of-the-art toxicity prediction methods using an independent test set.

Weight Selection in Multitask Loss Function

For these experiments, we use all components of our method: five types of base features, five base models, and the meta model. We consider the output of the meta model. Figure shows that our method achieves the best performance with weight 5 for LD50, IGC50, and LC50 and with weight 9 for LC50-DM. Henceforth, we will use these weights in our further experiments.

Figure 3

10-Fold cross-validation performance with various loss weight values evaluated against meta validation sets. (a) shows the performance for the LD50 task for a range of loss weight values, (b) shows the performance for the IGC50 task for a range of loss weight values, (c) shows the performance for the LC50 task for a range of loss weight values, and (d) shows the performance for the LC50-DM task for a range of loss weight values.

Performance Evaluation of Base Features

Table shows the performance of the base models using respective base features and using the base validation set. The table shows the averages of the 10 runs in the 10-fold cross-validation process. The standard deviation values are in Supporting Information, S3.

Table 2

10-Fold Cross-Validation Performance of the Base Models Using Respective Base Features and Using the Base Validation Seta

	r²	mae	rmse	r²	mae	rmse
base features	LD₅₀			IGC₅₀
DESC	0.553	0.465	0.627	0.788	0.355	0.482
MGF	0.544	0.469	0.630	0.764	0.358	0.502
MFP	0.566	0.457	0.621	0.737	0.417	0.549
FPeV	0.361	0.563	0.754	0.432	0.628	0.800
SeV	0.267	0.597	0.800	0.572	0.532	0.703
	LC₅₀			LC₅₀-DM
DESC	0.700	0.574	0.814	0.623	0.792	1.053
MGF	0.568	0.674	0.935	0.544	0.875	1.158
MFP	0.617	0.630	0.892	0.456	0.738	1.008
FPeV	0.369	0.909	1.157	0.297	0.937	1.270
SeV	0.567	0.699	0.962	0.431	0.955	1.289

The standard deviation values are given in Supporting Information, S3.

The standard deviation values are given in Supporting Information, S3. As shown in Table , DESC performed better in all three metrics for LC50 and IGC50. For LD50 and LC50-DM, MFP obtains the best results in all metrics except in one case. MGF shows reasonable performance close to DESC and MFP. The possible reason behind these performances might be the direct biological relevance of DESC, MFP, and MGF to activity prediction. SeV and fingerprint-embedded vector (FPeV) showed the lowest performance for all tasks in the three metrics possibly because of their no direct biological relevance to the activity prediction. From Supporting Information, S3, we observe that MGF in LD50 and LC50 and DESC in IGC50 and LC50-DM obtain the overall most stable performances with least deviation values. The standard deviation values are large for FPeV and SeV compared to those for DESC, MGF, and MFP. Comparatively, smaller data sets such as LC50 and LC50-DM show the least stable results in terms of standard deviation values.

Performance Evaluation of Meta Features

Our goal in this study is to effectively aggregate the chemical information extracted from various base features for quantitative toxicity data sets so that the regression performance can be improved. We have five types of base features, DESC, MGF, MFP, SeV, and FPeV. Hence, we have five based models. We consider all possible subsets of these five types of base features. This gives us 25 – 1 = 31 possible subsets. For each subset, we consider only the base features in the subset and then use only the corresponding base models. However, for each of the five subsets having just one type of base feature, in order to have ensembling effects, we use two base models using the same base features but trained separately. The set of four outputs of a subset of features is denoted by M, where i is the number of types of features in the subset, and j is a unique index within the subsets all having i types of features. These outputs are used as the meta features for the meta-learning models. To summarize, we consider 31 possible meta ensembling models. For convenience, M is also used to denote the corresponding meta ensembling model. Moreover, Mi denotes the set of meta models all having i types of features. Table shows 10-fold cross-validation performance of the 31 meta ensembling models. For instance, M1 represents single type of base features used in creating meta features, whereas M2, M3, M4, and M5 represent any two, three, four, and five different types of base features with no repetitions. The corresponding standard deviation values are in Supporting Information, S4.

Table 3

10-Fold Cross-Validation Results for Meta Features on Meta Validation Seta

		LD₅₀			IGC₅₀			LC₅₀			LC₅₀-DM
meta features	base features	r²	mae	rmse	r²	mae	rmse	r²	mae	rmse	r²	mae	rmse
M1-1	DESC, DESC	0.585	0.455	0.613	0.828	0.329	0.434	0.767	0.547	0.737	0.707	0.636	0.841
M1-2	MGF, MGF	0.550	0.470	0.628	0.803	0.345	0.465	0.662	0.650	0.860	0.736	0.670	0.848
M1-3	MFP, MFP	0.578	0.459	0.620	0.764	0.390	0.518	0.673	0.604	0.831	0.704	0.672	0.847
M1-4	FPeV, FPeV	0.365	0.554	0.742	0.431	0.612	0.781	0.491	0.834	1.082	0.507	0.832	1.106
M1-5	SeV, SeV	0.287	0.589	0.791	0.582	0.519	0.679	0.648	0.641	0.883	0.697	0.697	0.904
M2-1	MGF, MFP	0.637	0.419	0.571	0.846	0.303	0.411	0.736	0.542	0.726	0.714	0.624	0.813
M2-2	MGF, DESC	0.629	0.433	0.580	0.851	0.291	0.397	0.772	0.512	0.710	0.738	0.649	0.871
M2-3	MGF, SeV	0.577	0.458	0.617	0.820	0.331	0.444	0.736	0.573	0.760	0.710	0.600	0.804
M2-4	MGF, FPeV	0.564	0.468	0.629	0.837	0.318	0.426	0.712	0.578	0.784	0.719	0.674	0.861
M2-5	MFP, DESC	0.633	0.418	0.566	0.856	0.313	0.409	0.771	0.509	0.678	0.722	0.723	0.910
M2-6	MFP, SeV	0.593	0.455	0.614	0.779	0.376	0.496	0.702	0.584	0.770	0.745	0.613	0.786
M2-7	MFP, FPeV	0.597	0.447	0.593	0.773	0.381	0.501	0.671	0.616	0.828	0.690	0.676	0.868
M2-8	DESC, SeV	0.586	0.451	0.603	0.817	0.331	0.440	0.715	0.570	0.768	0.779	0.589	0.762
M2-9	DESC, FPeV	0.602	0.449	0.597	0.818	0.323	0.435	0.724	0.568	0.756	0.692	0.688	0.878
M2-10	SeV, FPeV	0.414	0.536	0.720	0.641	0.500	0.656	0.623	0.653	0.864	0.645	0.721	0.933
M3-1	MGF, MFP, DESC	0.647	0.412	0.559	0.866	0.289	0.385	0.770	0.499	0.676	0.812	0.566	0.724
M3-2	MGF, MFP, SeV	0.627	0.431	0.576	0.846	0.316	0.415	0.760	0.505	0.690	0.708	0.643	0.821
M3-3	MGF, MFP, FPeV	0.625	0.426	0.580	0.842	0.316	0.421	0.730	0.569	0.784	0.713	0.618	0.807
M3-4	MGF, DESC, SeV	0.614	0.433	0.577	0.853	0.302	0.403	0.756	0.512	0.693	0.778	0.589	0.756
M3-5	MGF, DESC, FPeV	0.634	0.431	0.578	0.860	0.306	0.407	0.777	0.514	0.684	0.737	0.607	0.813
M3-6	MGF, SeV, FPeV	0.573	0.458	0.613	0.821	0.339	0.450	0.752	0.535	0.720	0.719	0.637	0.798
M3-7	MFP, DESC, SeV	0.626	0.425	0.580	0.847	0.306	0.410	0.754	0.513	0.702	0.757	0.613	0.786
M3-8	MFP, DESC, FPeV	0.634	0.423	0.575	0.845	0.315	0.426	0.744	0.530	0.707	0.780	0.622	0.807
M3-9	MFP, SeV, FPeV	0.589	0.448	0.606	0.799	0.367	0.482	0.704	0.547	0.768	0.712	0.689	0.894
M3-10	DESC, SeV, FPeV	0.589	0.440	0.588	0.833	0.323	0.433	0.697	0.542	0.755	0.773	0.551	0.702
M4-1	MGF, MFP, DESC, SeV	0.644	0.413	0.551	0.860	0.290	0.384	0.783	0.516	0.701	0.736	0.558	0.754
M4-2	MGF, MFP, DESC, FPeV	0.637	0.413	0.555	0.859	0.294	0.400	0.786	0.489	0.678	0.783	0.563	0.731
M4-3	MGF, MFP, SeV, FPeV	0.629	0.421	0.569	0.840	0.313	0.422	0.768	0.544	0.725	0.725	0.618	0.810
M4-4	MGF, DESC, SeV, FPeV	0.619	0.430	0.578	0.851	0.298	0.403	0.761	0.525	0.718	0.741	0.584	0.784
M4-5	MFP, DESC, SeV, FPV	0.632	0.421	0.569	0.850	0.311	0.411	0.756	0.518	0.691	0.756	0.605	0.811
M5-1	MGF, DESC, SeV, FPeV, MFP	0.652	0.411	0.553	0.860	0.302	0.399	0.785	0.506	0.686	0.784	0.615	0.784

The standard deviation values are given in Supporting Information, S4.

The standard deviation values are given in Supporting Information, S4. Table shows that meta features in M3, M4, and M5 show overall better performance for most of the metrics for all four tasks. M3-1 which represents only those three features which are directly related to a biological activity prediction achieves better performance in four cases across three tasks. Similarly, M3-10, M4-1, M4-2, and M5-1 achieve better performance in two metrics across one or two tasks. Note that M3-1, M4-1, M4-2, and M5-1 are associated with at least two direct biologically relevant base features. Surprisingly, SeV and FPeV which lack in direct biological relevance help the LC50-DM task to improve r2, mae, and rmse from 0.707, 0.636, and 0.841 to 0.773, 0.551, and 0.702, respectively, when used with DESC. Using SeV and FPeV individually or together results in the worst performance for all metrics. Overall, meta features used with meta ensemble models result in more stable performances as shown in Supporting Information, S4. M2-8 shows the most stable performances across two tasks in the five performance metrics.

Effectiveness of Meta Models over Base Models

In order to investigate the effectiveness of meta models M2–M5 compared to the best meta models in M1 that have only one type of base feature, we computed the % improvement shown in Figure . An overall improvement can be observed in r2, mae, and rmse for all four tasks. As expected, meta model M2-10, which refers to using SeV and FPeV only, caused an increase in both types of errors and a decrease in the correlation substantially. For each task separately, Figure a–dhighlights the best meta model. We select M5-1 with an improvement of 11.44% for r2, 9.75% for mae, and 9.74% for rmse on LD50; M3-1 with an improvement of 4.53% for r2, 12.26% for mae, and 11.39% for rmse on IGC50; M4-2 with an improvement of 2.50% for r2, 10.58% for mae, and 8.00% for rmse on LC50; and M3-1 with an improvement of 10.39% for r2, 11.04% for mae, and 13.92% for rmse on LC50-DM. We developed our final model with these selected meta models for each task individually as follows for external independent testing.

Figure 4

Performance of meta models in M2–M5 in terms of % improvement over the corresponding best meta models in M1 on the meta validation set. (a) shows % improvement for LD50 task, (b) shows % improvement for IGC50 task, (c) shows % improvement for LC50 task, and (d) shows % improvement for LC50-DM task.

Comparative Landscape Using the External Independent Test Sets

We compare our method’s performance against the state-of-the-art methods in the literature, for example, the models used in the development of TEST software,[39] TopTox,[10] and Hybrid2d.[12] The results are shown in Table . As can be seen, from a total of 12 metrics, the proposed method obtains the best results in 11 of them. Especially in three of the four data sets, it dominates other algorithms with significant margin in all three metrics. The detailed results are discussed below for each task separately.

Table 4

Comparison of Our Method with Other Methods Using External Independent Test Sets for All Four Tasks

	r²	mae	rmse	r²	mae	Rmse
methods	LD₅₀			IGC₅₀
QuantitativeTox	0.687	0.394	0.537	0.861	0.269	0.365
hierarchical[39]	0.578	0.460	0.650	0.719	0.358	0.539
FDA[39]	0.557	0.474	0.657	0.747	0.337	0.489
group contribution[39]				0.682	0.411	0.575
nearest neighbor[39]	0.557	0.477	0.656	0.6	0.451	0.638
TEST consensus[39]	0.626	0.431	0.594	0.764	0.332	0.475
TopTox[10]	0.653	0.421	0.568	0.802	0.305	0.438
Hybrid2D[12]	0.629			0.810
	LC₅₀			LC₅₀-DM
QuantitativeTox	0.792	0.479	0.668	0.808	0.520	0.754
hierarchical[39]	0.71	0.574	0.801	0.695	0.757	0.979
single model[39]	0.704	0.605	0.803	0.697	0.772	0.993
FDA[39]	0.626	0.656	0.915	0.565	0.909	1.19
group contribution[39]	0.686	0.578	0.81	0.671	0.62	0.803
nearest neighbor[39]	0.667	0.649	0.876	0.733	0.745	0.975
TEST consensus[39]	0.728	0.545	0.768	0.739	0.727	0.911
TopTox[10]	0.788	0.446	0.677	0.788	0.592	0.805
Hybrid2D[12]	0.678			0.616

LD50: For this data set, which is the largest for the four, our proposed method dominates other methods in all three metrics with r2 = 0.687, mae = 0.394, and rmse = 0.537. The results of the TopTox method, in all three metrics, are better than those of the TEST software methods but worse than that of our proposed methods. The improvements of our method over the state-of-the-art TopTox method are 5.206% for r2, 6.413% for mae, and 5.457% for rmse. IGC50: As can be seen, for this data set, TEST consensus obtains the highest r2 of 0.764 among all models in TEST software, while the TopTox model achieved r2 of 0.802. Our proposed model obtained r2 of 0.861, which is better than all six models including TopTox. The proposed model also obtains better mae and rmse values of 0.269 and 0.365, respectively. The improvements over the state-of-the-art TopTox method are 7.356% for r2, 11.803% for mae, and 16.666% for rmse. LC50: In this task, our proposed method achieves the best results: 0.792 for r2 and 0.668 for rmse. For mae, our proposed method achieves the second-best result of 0.479, which is after 0.446 of TopTox. LC50-DM: This is the smallest data set among all four tasks. In this task, our method obtains the best results for all three metrics with r2 = 0.808, mae = 0.520, and rmse = 0.754. The improvements over the state-of-the-art method TopTox are 2.538% for r2, 12.162% for mae, and 6.335% for rmse.

Chemical Space and Prediction Confidence

A diverse data set covering a broad sample space is a prerequisite for building predictive models.[40] For all SMILES strings in the combined train and test sets, we computed the 2048-bit Morgan fingerprints using RDKIT.[26] We then use the t-SNE dimensional reduction technique[41] to convert the 2048 dimensional vectors into two t-SNE dimensions, which are shown in Figure with a perplexity value of 30. From the charts, train sets are observed to be covering the test sets for LD50, IGC50, and LC50, indicating the possibility of highly accurate predictions. However, in LC50-DM, the train set does not cover the test set, which is more diverse. This indicates that prediction will be more challenging for this data set. As shown in Figure and Tables –4, this finding is consistent with the comparatively lower performance of our proposed method on the LC50-DM data set than on the other three data sets.

Figure 6

2D t-SNE charts showing the dispersion and coverage of the chemical sample space of the train sets over the test sets for (a) LD50, (b) IGC50, (c) LC50, and (d) LC50-DM.

Prediction confidence interval for full training and external test sets of (a) LD50, (b) IGC50, (c) LC50, and (d) LC50-DM. The units of LC50, LC50-DM, and IGC50 end points are – log 10(T mol/L), where T represents the corresponding end points. For the LD50 set, the units are – log 10(LD50 mol/kg). 2D t-SNE charts showing the dispersion and coverage of the chemical sample space of the train sets over the test sets for (a) LD50, (b) IGC50, (c) LC50, and (d) LC50-DM. For each task separately, we show the prediction confidence for full training as well as external test sets in Figure . By taking 2 times the mean absolute error for external test sets of each task, above 85% of the predictions are covered in the confidence interval. Prediction confidence intervals for external test sets of LD50, IGC50, LC50, and LC50-DM are 87.93, 87.98, 87.80, and 85.71%, respectively. The prediction confidence for the external test set of IGC50 is the highest, whereas it is the lowest for the external test set of LC50-DM. The possible reason for the lowest prediction confidence interval for the external test set of LC50-DM might be its exceptionally diverse chemical space as shown in Figure d.

Figure 5

Discussion

We discuss how meta ensembling and multitask learning bring new insights into quantitative toxicity end points and can improve overall performances using off-the-shelf general features.

Impact of Multitask Weighted Loss Function

The effectiveness of multitasking which is being proven to be very useful for quantitative toxicity prediction[10] can be further improved by using it in meta-ensembling approaches with multitask weighted loss function optimization. As shown in Figure , it is clear that a smaller data set such as LC50-DM requires more weight in multitask loss optimization. Also, tasks with relatively smaller data sets show more fluctuations with changing the loss weight value. For instance, LD50 and IGC50 are relatively larger data sets than LC50 and LC50-DM and thus they are less affected by the loss weight value.

Impact of Aggregation of Various Base Features

Representing molecules using just one type of feature might not help capture all information about that molecule. For instance, basic molecular graph representations do not capture the quantum mechanical structure of molecules or do not necessarily express the information. Similarly, the models which use molecular graphs as input like graph convolution will not be able to distinguish between chiral molecules (molecules having the same graph structure with a mirror image of each other). In the case of fingerprints as an input, it is also possible that different molecules may have identical fingerprints and this will make it difficult for a model to distinguish between them. Features with direct biological relevance such as DESC, MGF, and MFP in Table prove very useful individually as opposed to those features which do not have any direct biological relevance such as FPeV and SeV. The predictive power of DESC, MGF, and MFP can be further enhanced when aggregated effectively with FPeV and SeV as shown in Table and Figure . Specifically, in the case of LC50-DM, the r2, mae, and rmse values are substantially improved by using SeV and FPeV with DESC (M3-10 meta model). This paves a path toward the idea of using weak features (SeV and FPeV) along with strong features (DESC) to further enhance the performance due to the strong features. The chemical information captured by weak features might not be significant by itself but can play a vital role when aggregated with the chemical information extracted using strong features. As shown in Supporting Information, S4, besides performance improvements, meta ensembling helps stabilize the results by producing low standard deviation in most metrics. In the case of smaller data sets such as LC50-DM, meta models, M1, decrease the mean standard deviation value substantially as compared to the mean stand deviation value for base models as shown in Figure of Supporting Information, S4.

Conclusions

Quantitative toxicity measurement has paramount importance in pharmaceuticals. Toxicity prediction for chemical compounds recently achieved enhanced performance in terms of accuracy after the introduction of various deep learning models in this space. Usually, molecules are represented by a given type of features and a specific machine learning method is then used to predict the toxicity. The performance of any quantitative toxicity prediction method depends upon the specific type of features and the learning model used. This restricts the overall performance to a single type of feature and a learning model. In this study, we have introduced a deep-learning-based framework called QuantitativeTox for predicting quantitative toxicity end points or data sets such as LD50, IGC50, LC50, and LC50-DM. Our approach has three stages: generating base features, training base learning models on the base features, and training a meta-learning model. We use five types of base features and then use five base learning models on them. The outputs of the base learning models are used as the meta features for the meta-learning model. To support multitask training, each model produces four outputs for four data sets and the loss function uses a weighted sum over the data sets. We have found that high-level physicochemical, low-level fingerprints, SMILES-embedded vectors, and fingerprint-embedded vectors when used to create meta features for the meta ensemble model enhance the performance over a wide range of metrics for the quantitative toxicity prediction tasks. We evaluated our framework against three main regression metrics using independent test sets and obtained a robust performance compared to state-of-the-art methods. Our framework can serve as a robust tool for quantitative toxicity prediction with a better aggregation strategy for various features along with individual models and multitasking. It can also be applied to other similar tasks which are related to each other like in the current study such as LD50, IGC50, LC50, and LC50-DM..

16 in total

Review 1. Use of machine learning approaches for novel drug discovery.

Authors: Angélica Nakagawa Lima; Eric Allison Philot; Gustavo Henrique Goulart Trossini; Luis Paulo Barbour Scott; Vinícius Gonçalves Maltarollo; Kathia Maria Honorio
Journal: Expert Opin Drug Discov Date: 2016 Impact factor: 6.098

2. Quantitative Toxicity Prediction Using Topology Based Multitask Deep Neural Networks.

Authors: Kedi Wu; Guo-Wei Wei
Journal: J Chem Inf Model Date: 2018-01-31 Impact factor: 4.956

3. PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints.

Authors: Chun Wei Yap
Journal: J Comput Chem Date: 2010-12-17 Impact factor: 3.376

4. FP2VEC: a new molecular featurizer for learning molecular properties.

Authors: Woosung Jeon; Dongsup Kim
Journal: Bioinformatics Date: 2019-12-01 Impact factor: 6.937

5. In silico prediction of ROCK II inhibitors by different classification approaches.

Authors: Chuipu Cai; Qihui Wu; Yunxia Luo; Huili Ma; Jiangang Shen; Yongbin Zhang; Lei Yang; Yunbo Chen; Zehuai Wen; Qi Wang
Journal: Mol Divers Date: 2017-08-02 Impact factor: 2.943

Review 6. Machine learning in chemoinformatics and drug discovery.

Authors: Yu-Chen Lo; Stefano E Rensi; Wen Torng; Russ B Altman
Journal: Drug Discov Today Date: 2018-05-08 Impact factor: 7.851

7. Computational determination of hERG-related cardiotoxicity of drug candidates.

Authors: Hyang-Mi Lee; Myeong-Sang Yu; Sayada Reemsha Kazmi; Seong Yun Oh; Ki-Hyeong Rhee; Myung-Ae Bae; Byung Ho Lee; Dae-Seop Shin; Kwang-Seok Oh; Hyithaek Ceong; Donghyun Lee; Dokyun Na
Journal: BMC Bioinformatics Date: 2019-05-29 Impact factor: 3.169

8. Developing and validating predictive decision tree models from mining chemical structural fingerprints and high-throughput screening data in PubChem.

Authors: Lianyi Han; Yanli Wang; Stephen H Bryant
Journal: BMC Bioinformatics Date: 2008-09-25 Impact factor: 3.169

9. Predicting drug side effects by multi-label learning and ensemble learning.

Authors: Wen Zhang; Feng Liu; Longqiang Luo; Jingxia Zhang
Journal: BMC Bioinformatics Date: 2015-11-04 Impact factor: 3.169

10. PyBioMed: a python library for various molecular representations of chemicals, proteins and DNAs and their interactions.

Authors: Jie Dong; Zhi-Jiang Yao; Lin Zhang; Feijun Luo; Qinlu Lin; Ai-Ping Lu; Alex F Chen; Dong-Sheng Cao
Journal: J Cheminform Date: 2018-03-20 Impact factor: 5.514