Literature DB >> 27120770

ToxCast EPA in Vitro to in Vivo Challenge: Insight into the Rank-I Model.

Sergii Novotarskyi¹, Ahmed Abdelaziz^2,3, Yurii Sushko¹, Robert Körner¹, Joachim Vogt¹, Igor V Tetko^4,5.

Abstract

The ToxCast EPA challenge was managed by TopCoder in Spring 2014. The goal of the challenge was to develop a model to predict the lowest effect level (LEL) concentration based on in vitro measurements and calculated in silico descriptors. This article summarizes the computational steps used to develop the Rank-I model, which calculated the lowest prediction error for the secret test data set of the challenge. The model was developed using the publicly available Online CHEmical database and Modeling environment (OCHEM), and it is freely available at http://ochem.eu/article/68104 . Surprisingly, this model does not use any in vitro measurements. The logic of the decision steps used to develop the model and the reason to skip inclusion of in vitro measurements is described. We also show that inclusion of in vitro assays would not improve the accuracy of the model.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2016 PMID： 27120770 PMCID： PMC5413193 DOI： 10.1021/acs.chemrestox.5b00481

Source DB: PubMed Journal: Chem Res Toxicol ISSN： 0893-228X Impact factor: 3.739

Introduction

The prediction of in vivo toxicity based on in vitro measurements is a challenging task, which is in the center of active development of modern computational toxicology.[1−4] The TopCoder data science competition platform in collaboration with Environment Protection Agency (EPA) organized the ToxCast challenge in 2014.[5] The target property was lowest effect level or LEL. LEL is defined as “the lowest dose that shows adverse effects in these animal toxicity tests. The LEL is then conservatively adjusted in different ways by regulators to derive a value that can be used by the Agency to set exposure limits that are expected to be tolerated by majority of the population.”[5] The total ToxCast challenge included five consecutive subchallenges, which were executed over a seven month period and attracted 432 registrants from 32 countries. The first subchallenge was to identify software libraries and/or methods to describe the chemical structure of various compounds. The second subchallenge was about identification of a specific combination of in vitro assays, which could be used to predict the in vivo systemic LEL. The third subchallenge was executed privately and was entitled “Predictive Capability Tests”. The challenge described in this study was the fourth subchallenge. It had a goal “to build a prediction model (algorithm) using data from high-throughput in vitro assays, chemical properties, and chemical structural descriptors to quantitatively predict a chemical’s systemic LEL.”[5] The final fifth subchallenge was about the documentation of the results of the models. In this study, we present the results of the Rank-I model for the prediction challenge. According to the challenge rules, the participants were strictly forbidden to use any data other than the data that were provided in this competition. This was done in order to offer equal conditions to all participants as well as to better evaluate the performance of models developed using in vitro measurements. Indeed, the use of any information outside of the data provided within the challenge, e.g., information about toxicological chemical pathways, could potentially bias the comparison of algorithms. The summary of this prediction challenge was published by the EPA.[6] Since September 2015, this information has been available from the Web archives[7] and included as Supporting Information (see section “EPA ToxCast LELPredictor Marathon Match Results Summary”) to this study. This article analyzes the steps that were taken to develop the Rank-I submission model by participant “novserj” (notice that the participant abbreviation is incorrectly reported as “noveserj” in the result table[6]). The Supporting Information also contains an extended technical description, which was submitted to TopCoder as part of the contest. Part of this description was also used by EPA in their report.[6,7] Data, Rank-I model, and model predictions are available at http://ochem.eu/article/68104. We believe that this article will be interesting to both participants and organizers of the challenge and will help them to better understand and interpret the results of the challenge. It will also help other scientists in developing models with high prediction power. Moreover, the model reported in this study has the highest prediction ability for LEL end point as validated by the challenge organizers and thus can be of potential value for people working on the risk assessment of chemical compounds.

Data

Training and Test Data Sets

The in vitro measurements provided within the scope of the challenge included “a battery of more than 700 biochemical and cell-based in vitro assays to identify what proteins, pathways, and cellular processes these chemicals interact with and at what concentration they interact.”[5] The total data set used during the challenge incorporated 1,854 molecules. The experimental LEL values were provided for 483 compounds that were used as the training set. The test set included LEL values for 143 chemicals, which were kept secret and were split into provisional (63) and final scoring (80) sets. During the challenge, TopCoder participants could submit predictions and receive the statistical results for the provisional set. Such results could be used to optimize the models during the submission stage. The 80 compounds from the final scoring set were used only once to rate the final model submissions. The challenge organizers did not reveal information about the compounds, which were used as the test sets. After the challenge, the experimental values for 143 compounds from the test set were kindly released by the TopCoder organizers to the authors of the article. Both training and test set compounds are publicly available for download from http://ochem.eu/article/68104.

Methods

The detailed description of the Rank-I submission is provided in Supporting Information (see “Technical description” section). Below, we briefly recapitulate the main steps.

Descriptor Packages

Ten different descriptor packages implemented in the public platform OCHEM[8] were used individually to create ten models for the resulting consensus. Four descriptor packages, E-state,[9] QNPR,[10] ISIDA fragmentor,[11] and GSFrag,[12] were based on the 2D representations of the chemical structures. The other six packages, Inductive,[13] ChemAxon,[14] Adriana,[15] Mera/Mersy,[16] CDK,[17] and Dragon[18] descriptors, were calculated using 3D representations of the chemical structures. The 3D structure representation was generated using Corina.[19]

Unsupervised Descriptor Selection

Within each individual model, the basic unsupervised descriptor selection procedure was performed. First, descriptors with constant values for the data set were removed. Next, duplicated descriptors with pairwise correlation of more than 0.95 were eliminated. Exactly the same procedure was used for all models, and thus, the same number of descriptors and molecules were utilized to develop each model with different machine learning methods. The selected descriptors’ count for each model after the unsupervised filtering is shown in Table .

Table 1

Number of Descriptors and Models’ Accuracy for the Prediction of the Test Set Compounds

		RMSE
descriptor set	number of selected descriptors	whole test set (n = 143)	inside of ADa (n = 136)	outside of AD (n = 7)
CDK	159	1.13	1.01	2.4
Dragon	1824	1.15	1.05	2.4
Fragmentor	631	1.18	1.04	2.7
GSFrag	202	1.1	0.97	2.5
Mera, Mersy	242	1.04	0.96	2.1
Chemaxon	97	1.16	1.06	2.4
Inductive	39	1.17	1.03	2.7
Adriana	133	1.14	1.01	2.5
QNPR	381	1.12	1.02	2.7
E-state	185	1.16	1	2.8
in vitro	143	1.21	1.11	2.5
Consensus	4036	1.08	0.96	2.5

AD is the applicability domain of the model as defined by OCHEM[8] (see also ref (20)).

Machine Learning Methods

The model used in the challenge was developed with Associative Neural Networks (ASNN). The ASNN exploits the idea of ensemble learning. It can be considered in a simplified way as a combination of k-nearest neighbor (kNN) method applied in the space of ensemble predictions. The models developed with the ASNN were top-ranked in several benchmarking studies[3,20−29] and that is why this method was selected for the EPA challenge. The default parameters for the ASNN algorithm, as optimized during previous studies and provided on the OCHEM Web site were used. They included 64 neural networks in ensemble, 3 neurons in a hidden layer trained by the SuperSAB[30] algorithm. In addition to ASNN, we also analyzed kNN, support vector machines (as implemented in LibSVM),[31,32] and partial least squares[33] methods. As with the ASNN method, the default parameters of these algorithms as provided on the OCHEM site were used.

Validation Protocols

The unbiased estimation of the models’ performance is critically important for selection and decision making for development of models. Two protocols, cross-validation, and bagging are frequently used to estimate validation accuracy for the training set. The cross-validation protocol splits the initial data set into n chunks. It uses n – 1 subsets as the training set and predicts the remaining chunk of the data. Bootstrap aggregation (bagging) is another powerful approach to develop and validate models developed by Leo Breiman.[34] It is based on the aggregation of models, each one of which is developed with its own training set (“bag” in the terminology of Breiman). Each bag is formed by random sampling with replacement from the initial training set and has the same size as the initial set. The molecules (on average 37%), which do not participate in the respective training set, are called “out-of-the-bag”. The predictions for these molecules are used to evaluate the predictive power of models. The bag size of 64 models was used.

Supervised Descriptor Selection Using Neural Network Pruning

In the 90s, there were several theoretical developments to identify the most significant descriptors for neural networks.[35−39] Some of these methods calculate the sensitivity (importance) of input parameters according to derivatives of neural network weights with respect to the error function,[35] while others provide such estimations based on the analysis of the magnitudes of the neural network weights.[37] For this study, we used a method from the second group, which provided the best results in our previous studies.[37−39] The sensitivity S of a neuron i was calculated aswhere w were weights connecting neuron i and j, maxa was taken over all weights ending at the neuron j having sensitivity S, and summation was taken over all weights connecting the neuron i with the upper layer neurons. The sensitivity calculations were performed recursively starting from the last layer neuron, which had sensitivity of 1. The pruning procedure started once neural network training was completed. On each step, the least significant descriptors with smallest S were eliminated, and the models were retrained with the decreased set of descriptors. The sets of descriptors, which calculated the minimal errors, were considered as the optimal ones. In order to avoid overfitting and overtraining,[40] the neural networks were trained using the efficient partitioning algorithm,[41] which uses the early stopping procedure.[40]

Statistical Parameters

The root mean square error (RMSE) metric was used to score models. The RMSE is lower for models with higher performance. The challenge organizers used the following scoring functionto rank the models. As a result, models with lower RMSE got a higher score and higher rank among the others. In addition to RMSE, the organizers also reported Pearson correlation coefficients and AUC defined as “percentage of pairs where predicted1 < predicted2 among those where ground_truth1 < ground_truth2 (the higher the value, the better the result)”.[6]

Results

The workflow for the model development used in the challenge (see “Technical description” in the Supporting Information) was based on our previous expertise to develop recently published models.[4,22,23,42] Final statistical results for the top-ranking models are summarized in Table . Below, we provide a detailed analysis of the steps, which were used to develop the model.

Table 2

Summary of the Performance of the Top-Ranked Models of the EPA ToxCast Challenge

			test set
	training set (n = 483)a		provisional subset (n = 63)		final subset (n = 80)			full, n = 143
model	RMSE	R²	RMSE	rank	RMSE	R²	rank	RMSE
novserj	0.88 ± 0.04	0.27 ± 0.04	1.03 ± 0.08b	8	1.12 ± 0.08b	0.31	1	1.08 ± 0.07
NobuMiu			1.03	9	1.13	0.30	2	1.09
a9108tc			1.05	16	1.13	0.29	3	1.10
klo86 min			1.09	27	1.14	0.29	4	1.12
in vitro assaysc	0.97 ± 0.04	0.11 ± 0.03						1.24 ± 0.09
MW + NCd	0.97 ± 0.04	0.11 ± 0.03						1.18 ± 0.08

Prediction accuracy for the “out-of-the-bag” samples.

Confidence intervals were estimated using the subsets, which were sampled from the training set, and each had the same size as the respective test set (see for more details ref (23)).

Best model based on the in vitro assay descriptors developed using the LibSVM method (see also Table S1).

Model based on molecular weight (MW) and number of carbon atoms (NC) developed using the same approach as the above in vitro model.

Prediction accuracy for the “out-of-the-bag” samples. Confidence intervals were estimated using the subsets, which were sampled from the training set, and each had the same size as the respective test set (see for more details ref (23)). Best model based on the in vitro assay descriptors developed using the LibSVM method (see also Table S1). Model based on molecular weight (MW) and number of carbon atoms (NC) developed using the same approach as the above in vitro model.

Failed Molecules

There were 37 molecules, including 11 molecules from the training set, for which descriptor generation failed for different packages. CDK descriptor package does not support inorganic elements such as [Sn], [Hg], [B], and [As]. The failed molecules either included unsupported atoms for CDK or some groups, e.g., [N3+]. Several other molecules, e.g., rifampicin, α-cyclodextrin, milbemectin, emamectin benzoate, etc. were large chemical structures and failed either due to time-out or structure conversion problems. According to the challenge rules, the participants were required to submit predictions for all molecules. Therefore, we had to submit some values. As a simple solution, we used an average value of all training set molecules, logLEL = −3.2602 log(M), as the predicted values for the failed molecules

Scoring of Models: How Useful Is the Provisional Test Statistics?

The challenge organizers offered a provisional set of N = 63 compounds for the purpose of model analysis and selection. However, we decided to skip the testing on this set for the following considerations. The provisional test set was much smaller than the training set. Thus, an attempt to rely on the models’ performance for this set by, e.g., submission of multiple predictions and selection of a “best” model using it, could contribute a higher uncertainty and result in the selection of a nonoptimal model for the final set. Indeed, the final model RMSE was 0.88 ± 0.04 for N = 472 training set molecules. The provisional set was not available, and thus, we were unable to calculate the confidence intervals for it. We estimated the intervals by random sampling of N = 63 molecules from the training set, for each of which we calculated the intervals. The confidence intervals for these sets were about 2-fold larger ±0.08. Thus, selecting the best model based on the performance for the provisional test set is about twice as uncertain compared to selecting based on the training set. Therefore, instead of relying on the accuracy of models for the provisional test set, a strategy to rely on the estimated validated results for the training set is more reliable. An even better strategy could be to select a model based on the combined accuracies of the provisional and training sets, but such an analysis was not implemented. The confidence intervals for N = 80 molecules were about the same as that for N = 63 molecules. The wide confidence intervals for both provisional and final test sets might have contributed to the fluctuations of ranks of challenge models for both sets. For example, the top final scoring model was only ranked #8 for the provisional submission, while the fourth model was ranked #27. Vice versa, the models ranked top #1 and #4 for provisional submissions were ranked as #9 and #34 (out of 47 participants) for the final test set.[5] Thus, indeed, the provisional ranking score was not strongly predictive of the final one: provisional and final models’ ranks were correlated only with correlation coefficient R = 0.76. The RMSE of the eight top-ranked final models were in the range 1.12 to 1.16 and thus were within the confidence intervals of the winning model. Thus, statistically speaking, these models had the same performance, and their differences in performance were due to chance.

Analysis of the Machine Learning Methods

The model submitted to the TopCoder challenge was a consensus of the bagging models developed with the ASNN method. In this section, we briefly describe the considerations that were used to develop and select this model for the challenge. The OCHEM Web site provides several machine learning methods and descriptors. Below, we compare the performance of different methods, which are described in the Methods section. Bagging vs cross-validation was compared. Table S1 in Supporting Information demonstrates that models developed using bagging had consistently smaller validation RMSEs as compared to the cross-validation results. This result is in agreement with our previous observations.[4,21,22,24] Therefore, the bagging approach was used. Comparison of different machine learning methods (Table S1) shows that combinations of machine learning methods and descriptors provided quite similar performances with RMSE ranging from 0.9 to 1.2 log units. Considering that 95% confidence intervals of RMSE were ±0.4 log units, the majority of these models were not significantly different. The LibSVM approach resulted in the lowest RMSE for individual models. The highest RMSE = 1.2 (i.e., the lowest performance) was calculated using the PLS method for Mera/Mersy descriptors. Actually, the failure of this method was due to several outlying molecules that had predictions far beyond the range of the training set values. They may be due to the sparseness of descriptors used to develop models and insufficient number of data points used to fit the coefficients in PLS. If we limited the predicted values for all compounds to the range of the training set values, the results of PLS models became similar to those of other methods. It is interesting that the model developed using in vitro assay measurements consistently provided the lowest accuracy compared to the other descriptors.

Development of the Rank-I Submission Model

Considering that models developed with different descriptor sets had approximately similar performance, we decided to build our consensus models using a simple average of individual models. For each machine learning approach, a consensus model was built for all descriptor packages. Since the individual models were calculated using the bagging approach, the developed consensus models were also validated using the same protocol. The model based on the ASNN method calculated the lowest RMSE error compared to consensus models developed using other machine learning approaches. The exclusion of the model based on in vitro descriptors did not change the accuracy of the ASNN consensus model. The model based on a combination of both in silico and in vitro descriptors requires both sets of descriptors. This limits its application to compounds for which in vitro measurements are present, while the model based exclusively on in silico descriptors can be applied to any new compounds. Therefore, we decided to submit the model developed using only in silico descriptors to the challenge. The model development steps were based on simple decisions, which followed “Occam’s razor” principle. First, we found that the models developed on the training set have large validation RMSE and that the provisional set statistics had a limited value for model selection. Therefore, we followed the model development steps, which were successful in our previous studies.[20,21,23−27,42] This strategy allowed us to develop the Rank-I model.

Comparison with a Simple Two-Descriptors Based Model

Did the complexity of the final model (a consensus of several individual models, each of which uses a different descriptor set and is a bootstrap aggregation of multiple neural network submodels) add any value, or could we get some similar results using a simpler approach? In order to answer this question, we developed models using just two descriptors: molecular weight and number of carbon atoms using linear regression. The RMSE of this model on the training set was 1.0 ± 0.04 log unit. The use of the same descriptors for the bagging approach decreased RMSE to 0.97 ± 0.04 log unit for the LibSVM method (Table ). This error was significantly higher than that of the Rank-I model. Interestingly, the best model calculated based on the in vitro assay measurements had exactly the same accuracy (RMSE = 0.97 ± 0.04).

Analysis of the Test Set Compounds

The TopCoder organizers kindly released information about the experimental values for the N = 143 test set molecules. It allowed us to provide an additional analysis of the results for this set and to better evaluate the influence of the in vitro descriptors on the prediction accuracy.

Analysis of Several Models Involving in Vitro Descriptors

The model developed using in vitro descriptors (see Table ) had higher RMSE = 1.24 for the test set as compared to that of the consensus model, RMSE = 1.08, based on in silico descriptors. The extension of the consensus model by inclusion of the model based on in vitro descriptors increased the RMSE of the new consensus model to 1.10 log units. We also explored whether extension of in silico descriptors with in vitro descriptors can provide better prediction accuracy. For this study, we developed models using combinations of each descriptor set with in vitro descriptors. The RMSE of the models developed with in silico + in vitro sets were changed in the range of −0.02 to 0.01 log units compared to RMSEs of models calculated using only in silico descriptors. The RMSE of the consensus model based on in silico + in vitro sets was 1.09, i.e., 0.01 log units higher compared to that of the model based only on in silico descriptors.

Chemical Diversity

The RMSE calculated for the test set was significantly higher than that for the training set compounds (Table ). An analysis of extended functional groups (EFG)[43] was done to identify whether both sets contained chemically different compounds. The EFG consists of 583 manually curated functional groups, which provide comprehensive coverage of heterocyclic compounds and are relevant for medicinal chemistry. The SetCompare tool[23] was used to determine statistically significantly overrepresented chemical groups in training and in test sets using hypergeometric distribution. It was found that hydroxy compounds, amines, saturated six-membered heterocycles containing one heteroatom, etc., were overrepresented in the test set, while pnictogens, thiophosphoric acid esters, halogen derivatives, etc. were overrepresented in the training set. The full list of overrepresented groups is available at http://ochem.eu/article/68104. Thus, the chemical diversity of molecules in both sets may have contributed to the observed differences in the RMSEs for the training and test sets.

Analysis of Compounds Predicted with Large Errors

The EFG and SetCompare tools were also used to analyze which chemical features contributed to predictions with high errors. A difference of 1.5 log units between predicted and experimental values was used to identify N = 62 compounds with high prediction errors. Most of these compounds had extreme LEL values, i.e., either low or high values, and only 7 compounds (10%) were within [−2, −4] log(M) interval as compared to 420 (75%) of compounds that were within this region for the remaining group. Thus, the model had difficulties in predicting highly toxic and nontoxic compounds.

Applicability Domain

According to the TopCoder rules, the submitted results were scored using predictions for all test set compounds. However, of course, some of the compounds from the test set could be outside of the applicability domain (AD) and have lower prediction accuracy. OCHEM uses standard deviation of the predictions of models, which contribute to the consensus model, as distance to the model (abbreviated as STD-CONS).[20] STD-CONS was found as the best definition of the distance to the model.[20,27] OCHEM defines AD of a model as the value of STD-CONS that covers 95% predictions from the training set. In the previous benchmarking study of 12 definitions of distances to the model applied to 11 models, we found that STD-CONS provided the best separation of molecules with large and low prediction errors notwithstanding the used model.[20] Therefore, we concluded that AD of models is determined by the composition of the training set of molecules rather than by the used descriptors or machine learning methods. In the current study, seven compounds from the test set were outside of the AD of the consensus model. The consensus model as well as individual submodels calculated significantly higher RMSEs, which were in the range of 2.1 to 2.7 log units, for these seven compounds (Table ). This result supports the previous conclusions about the universal nature of the AD of models and good discriminating power of STD-CONS distance to the model. It also indicates that taking into consideration the AD of the models is important to avoid predictions with high errors. It is interesting that four out of seven compounds had LEL < 3, while other three compounds had LEL > 5.5. Thus, the used AD has identified compounds that had experimental toxicity values in the ranges that are difficult to predict.

Development of Models Using Descriptors Optimized with Pruning

The final consensus model was based on 10 submodels, which were developed with N = 4036 descriptors (Table ). These descriptors were selected from the initial set following the unsupervised filtering procedure. We explored whether the performance of this model can be further improved by using a supervised descriptor selection procedure based on neural network pruning of the least sensitive descriptors. The application of this procedure decreased the numbers of descriptors in 5 to 100-fold (Table ). Models developed using these descriptors had on average lower training set RMSEs as compared to those based on descriptors selected by the unsupervised filtering, while the opposite result was calculated for the test set RMSEs (Table ). Thus, selection of descriptors optimal for the training set introduced variable selection bias.[44] Indeed, during supervised selection of descriptors we evaluated the performance of models for the training set molecules multiple times. This resulted in selection of descriptors with improved fit for this set but at the same time decreased the prediction accuracy for test set compounds, which have different chemical diversities. The neural networks are very efficient methods to work with high-dimensional data[45] and can be also efficiently used without a need of supervised variable selection.

Table 3

Performances of Models Developed Using Different Descriptor Selection Proceduresa

	unsupervised selection			neural network pruning
		RMSE			RMSE
descriptor set	N	training	test	N	training	test
CDK	159	0.93	1.13	6	0.89	1.2
Dragon	1824	0.93	1.15	18	0.87	1.19
Fragmentor	631	0.98	1.18	12	0.92	1.21
GSFrag	202	0.97	1.1	24	0.97	1.18
Mera, Mersy	242	0.93	1.04	10	0.93	1.18
Chemaxon	97	0.93	1.16	11	0.92	1.16
Inductive	39	0.94	1.17	21	0.93	1.16
Adriana	133	0.93	1.14	8	0.92	1.1
QNPR	381	0.95	1.12	74	0.89	1.13
E-state	185	0.96	1.16	11	0.9	1.24
Consensus	4036	0.88	1.08	186	0.85	1.13

N is the number of descriptors selected to develop the respective model. RMSE is the root mean squared error calculated for the training (n = 483) and full test set (n = 143).

Discussion

In this study, we highlighted the steps used to develop the Rank-I submission for the EPA ToxCast challenge, which was organized by the TopCoder community. We have shown how to consider limitations of the training and test data sets and that following “Occam’s razor” principle helps to provide a top-entry to the challenge. This conclusion is supported by other studies. A similar consensus approach was used to achieve the overall best balanced accuracy for 12 end points for another ToxCast challenge[3] organized by NIH.[46] The consensus modeling was also successfully used in the CERAPP project to identify potential endocrine disruptors.[47] It is rather surprising that the Rank-I model did not involve the in vitro descriptors. This can be attributed to several factors.

Poor Definition of the Predicted End Point

The LEL is defined as lowest effect level dose across multiple animal studies. This can contribute to considerable differences in the determined quantitative toxicity thresholds due to interspecies variations as well as differences in the experimental protocols. These factors could contribute to the biological noise of the measured values and make their prediction a difficult task.

Lack of Domain-Specific Modeling Approaches

The relatively weak performance of this model and all others in the challenge can also point out the limitation of the brute-force machine learning approach to this problem. The in vitro assay data may need to be treated as more than just a table of numbers, and one will need to incorporate biological knowledge into the structure of the model. Indeed, pharmacokinetic and pharmacodynamics properties of the analyzed molecules could be essential for their toxicity. Thus, we can expect that the use of systems biology methods can contribute to more accurate predictions of the LEL. It should be mentioned that the use of external data was explicitly forbidden for the purpose of the ToxCast challenge.

Insufficiency of Used in Vitro Assays

We cannot exclude the possibility that some of the currently used in vitro assays could be insufficient for the analyzed end point. For example, if toxicity is caused by metabolites of the analyzed compound the in vitro assays ignoring metabolic activation may not correctly report toxicity. Currently, it is not clear whether such problems frequently occur, but recent studies suggest that taking into consideration the metabolic activation was an important factor for prioritization of potentially emerging contaminants.[21] Of course, the same problem can also contribute to difficulties with prediction of toxicity based on in silico descriptors. Which of these factors contributed to the low accuracy of the model? Such analysis is beyond the scope of the article and will hopefully be answered in the future with new computational studies by the scientific community. Importantly, the public availability of the test set compounds released in this article will help other users to develop and benchmark new approaches to predict LEL and benchmark their results against the Rank I model of the ToxCast challenge. Moreover, since the model is publicly available and does not use in vitro descriptors, it can be used to predict the LEL of new compounds in prospective studies and can be benchmarked using new measurements, which may be available in the future. We believe publishing models online in a usable and reproducible manner will become an integral part of future computational chemistry.[48] In summary, we have described the protocol for developing the Rank-I model of the EPA ToxCast challenge. The model is based only on in silico descriptors, and we were not able to increase its prediction ability using in vitro measurements in a postmarathon study presented in this article. The relatively low accuracy of this model indicates high complexity of the LEL and suggests that pure brute-force machine-learning approaches may not be sufficient to accurately predict such a complex biological end point. Possibly, systems biology approaches can help to develop better models for the prediction of LEL using the available in vitro measurements. At the same time, we cannot exclude the possibility that the currently used in vitro assays may not be sufficient to correctly characterize this end point. The developed model and used data are publicly available at http://ochem.eu/article/68104 and can be used by interested users to answer these questions as well as to benchmark new ideas, methods, or approaches.

26 in total

1. LINGO, an efficient holographic text based method to calculate biophysical properties and intermolecular similarities.

Authors: David Vidal; Michael Thormann; Miquel Pons
Journal: J Chem Inf Model Date: 2005 Mar-Apr Impact factor: 4.956

2. Pruning algorithms-a survey.

Authors: R Reed
Journal: IEEE Trans Neural Netw Date: 1993

3. Application of a pruning algorithm to optimize artificial neural networks for pharmaceutical fingerprinting.

Authors: I V Tetko; A E Villa; T I Aksenova; W L Zielinski; J Brower; E R Collantes; W J Welsh
Journal: J Chem Inf Comput Sci Date: 1998 Jul-Aug

4. Critical assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: focusing on applicability domain and overfitting by variable selection.

Authors: Igor V Tetko; Iurii Sushko; Anil Kumar Pandey; Hao Zhu; Alexander Tropsha; Ester Papa; Tomas Oberg; Roberto Todeschini; Denis Fourches; Alexandre Varnek
Journal: J Chem Inf Model Date: 2008-08-26 Impact factor: 4.956

5. Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information.

Authors: Iurii Sushko; Sergii Novotarskyi; Robert Körner; Anil Kumar Pandey; Matthias Rupp; Wolfram Teetz; Stefan Brandmaier; Ahmed Abdelaziz; Volodymyr V Prokopenko; Vsevolod Y Tanchuk; Roberto Todeschini; Alexandre Varnek; Gilles Marcou; Peter Ertl; Vladimir Potemkin; Maria Grishina; Johann Gasteiger; Christof Schwab; Igor I Baskin; Vladimir A Palyulin; Eugene V Radchenko; William J Welsh; Vladyslav Kholodovych; Dmitriy Chekmarev; Artem Cherkasov; Joao Aires-de-Sousa; Qing-You Zhang; Andreas Bender; Florian Nigsch; Luc Patiny; Antony Williams; Valery Tkachenko; Igor V Tetko
Journal: J Comput Aided Mol Des Date: 2011-06-10 Impact factor: 3.686

6. Development of dimethyl sulfoxide solubility models using 163,000 molecules: using a domain applicability metric to select more reliable predictions.

Authors: Igor V Tetko; Sergii Novotarskyi; Iurii Sushko; Vladimir Ivanov; Alexander E Petrenko; Reiner Dieden; Florence Lebon; Benoit Mathieu
Journal: J Chem Inf Model Date: 2013-07-15 Impact factor: 4.956

7. CERAPP: Collaborative Estrogen Receptor Activity Prediction Project.

Authors: Kamel Mansouri; Ahmed Abdelaziz; Aleksandra Rybacka; Alessandra Roncaglioni; Alexander Tropsha; Alexandre Varnek; Alexey Zakharov; Andrew Worth; Ann M Richard; Christopher M Grulke; Daniela Trisciuzzi; Denis Fourches; Dragos Horvath; Emilio Benfenati; Eugene Muratov; Eva Bay Wedebye; Francesca Grisoni; Giuseppe F Mangiatordi; Giuseppina M Incisivo; Huixiao Hong; Hui W Ng; Igor V Tetko; Ilya Balabin; Jayaram Kancherla; Jie Shen; Julien Burton; Marc Nicklaus; Matteo Cassotti; Nikolai G Nikolov; Orazio Nicolotti; Patrik L Andersson; Qingda Zang; Regina Politi; Richard D Beger; Roberto Todeschini; Ruili Huang; Sherif Farag; Sine A Rosenberg; Svetoslav Slavov; Xin Hu; Richard S Judson
Journal: Environ Health Perspect Date: 2016-02-23 Impact factor: 9.031

8. Modeling the Biodegradability of Chemical Compounds Using the Online CHEmical Modeling Environment (OCHEM).

Authors: Susann Vorberg; Igor V Tetko
Journal: Mol Inform Date: 2013-11-28 Impact factor: 3.353

9. How accurately can we predict the melting points of drug-like compounds?

Authors: Igor V Tetko; Yurii Sushko; Sergii Novotarskyi; Luc Patiny; Ivan Kondratov; Alexander E Petrenko; Larisa Charochkina; Abdullah M Asiri
Journal: J Chem Inf Model Date: 2014-12-09 Impact factor: 4.956

10. Extended Functional Groups (EFG): An Efficient Set for Chemical Characterization and Structure-Activity Relationship Studies of Chemical Compounds.

Authors: Elena S Salmina; Norbert Haider; Igor V Tetko
Journal: Molecules Date: 2015-12-23 Impact factor: 4.411

8 in total

Review 1. QSAR without borders.

Authors: Eugene N Muratov; Jürgen Bajorath; Robert P Sheridan; Igor V Tetko; Dmitry Filimonov; Vladimir Poroikov; Tudor I Oprea; Igor I Baskin; Alexandre Varnek; Adrian Roitberg; Olexandr Isayev; Stefano Curtarolo; Denis Fourches; Yoram Cohen; Alan Aspuru-Guzik; David A Winkler; Dimitris Agrafiotis; Artem Cherkasov; Alexander Tropsha
Journal: Chem Soc Rev Date: 2020-05-01 Impact factor: 54.564

2. Transformer-CNN: Swiss knife for QSAR modeling and interpretation.

Authors: Pavel Karpov; Guillaume Godin; Igor V Tetko
Journal: J Cheminform Date: 2020-03-18 Impact factor: 5.514

3. Variability in in vivo studies: Defining the upper limit of performance for predictions of systemic effect levels.

Authors: Ly Ly Pham; Sean Watford; Prachi Pradeep; Matthew T Martin; Russell Thomas; Richard Judson; R Woodrow Setzer; Katie Paul Friedman
Journal: Comput Toxicol Date: 2020-08-01

4. ToxRefDB version 2.0: Improved utility for predictive and retrospective toxicology analyses.

Authors: Sean Watford; Ly Ly Pham; Jessica Wignall; Robert Shin; Matthew T Martin; Katie Paul Friedman
Journal: Reprod Toxicol Date: 2019-07-21 Impact factor: 3.143

5. In Silico Models for Repeated-Dose Toxicity (RDT): Prediction of the No Observed Adverse Effect Level (NOAEL) and Lowest Observed Adverse Effect Level (LOAEL) for Drugs.

Authors: Fabiola Pizzo; Domenico Gadaleta; Emilio Benfenati
Journal: Methods Mol Biol Date: 2022

6. Trade-off Predictivity and Explainability for Machine-Learning Powered Predictive Toxicology: An in-Depth Investigation with Tox21 Data Sets.

Authors: Leihong Wu; Ruili Huang; Igor V Tetko; Zhonghua Xia; Joshua Xu; Weida Tong
Journal: Chem Res Toxicol Date: 2021-01-29 Impact factor: 3.739

7. In silico identification of protein targets for chemical neurotoxins using ToxCast in vitro data and read-across within the QSAR toolbox.

Authors: Y G Chushak; H W Shows; J M Gearhart; H A Pangburn
Journal: Toxicol Res (Camb) Date: 2018-03-12 Impact factor: 3.524

8. Predicting in vivo effect levels for repeat-dose systemic toxicity using chemical, biological, kinetic and study covariates.

Authors: Lisa Truong; Gladys Ouedraogo; LyLy Pham; Jacques Clouzeau; Sophie Loisel-Joubert; Delphine Blanchet; Hicham Noçairi; Woodrow Setzer; Richard Judson; Chris Grulke; Kamel Mansouri; Matthew Martin
Journal: Arch Toxicol Date: 2017-10-27 Impact factor: 5.153

8 in total