| Literature DB >> 33431007 |
Andrea Morger1, Miriam Mathea2, Janosch H Achenbach2, Antje Wolf2, Roland Buesen2, Klaus-Juergen Schleifer2, Robert Landsiedel2, Andrea Volkamer3.
Abstract
Risk assessment of newly synthesised chemicals is a prerequisite for regulatory approval. In this context, in silico methods have great potential to reduce time, cost, and ultimately animal testing as they make use of the ever-growing amount of available toxicity data. Here, KnowTox is presented, a novel pipeline that combines three different in silico toxicology approaches to allow for confident prediction of potentially toxic effects of query compounds, i.e. machine learning models for 88 endpoints, alerts for 919 toxic substructures, and computational support for read-across. It is mainly based on the ToxCast dataset, containing after preprocessing a sparse matrix of 7912 compounds tested against 985 endpoints. When applying machine learning models, applicability and reliability of predictions for new chemicals are of utmost importance. Therefore, first, the conformal prediction technique was deployed, comprising an additional calibration step and per definition creating internally valid predictors at a given significance level. Second, to further improve validity and information efficiency, two adaptations are suggested, exemplified at the androgen receptor antagonism endpoint. An absolute increase in validity of 23% on the in-house dataset of 534 compounds could be achieved by introducing KNNRegressor normalisation. This increase in validity comes at the cost of efficiency, which could again be improved by 20% for the initial ToxCast model by balancing the dataset during model training. Finally, the value of the developed pipeline for risk assessment is discussed using two in-house triazole molecules. Compared to a single toxicity prediction method, complementing the outputs of different approaches can have a higher impact on guiding toxicity testing and de-selecting most likely harmful development-candidate compounds early in the development process.Entities:
Keywords: Androgen receptor; Applicability domain; Case study; Confidence estimation; Conformal prediction; Random forest; Read-across; ToxCast; Toxicity prediction; Triazoles
Year: 2020 PMID: 33431007 PMCID: PMC7157991 DOI: 10.1186/s13321-020-00422-x
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Size and purpose of androgen receptor antagonism datasets used to validate the original conformal prediction model
| Dataset | Purpose | Actives | Inactives |
|---|---|---|---|
| ToxCast-AA 762 | Train and test model | 868 | 5842 |
| In-house-AA | Validation I | 280 | 254 |
| External-AA | Validation II | 160 | 201 |
Fig. 1Overview of KnowTox. Combining toxicity information from different sources, the complementary outputs of the KnowTox tool help to generate a holistic toxicity prediction picture for a novel query compound. ToxCast Database bar plot: Number of actives (grey) and inactives (blue) available per endpoint, sorted by number of actives. Red vertical line: CP models were built for the endpoints on the left side of the threshold line (at least 300 active and inactive compounds tested, red horizontal line)
Conformal prediction models built for androgen receptor antagonism
| Model name | Descriptors | nca | Balancing |
|---|---|---|---|
| Original | Morgan + MACCS | Default | No |
| Normalised | Morgan + MACCS + physchemb | Normalised | No |
| Normalised + balanced | Morgan + MACCS + physchemb | Normalised | Yes |
anc: nonconformity score
bphysicochemical descriptors
Fig. 2Schematic description of CP workflow. Data is split into training and test set (blue box). The training set is further divided into calibration (red box) and proper training set (violet box). An ML model is fitted on the proper training set and used to predict compounds of the calibration and test set. Predictions are transformed into nonconformity scores (nc scores). Calibration is conducted by sorting the nc scores of the calibration set (class-wise, mondrian) into two lists. The nc score of a test compound is arranged in the list and thus the p-value calculated. An additional normaliser model (green box) can optionally be fitted on the descriptors and nc scores of the compounds of the proper training set
Fig. 3Calibration plots of the original, normalised, and normalised + balanced ToxCast-AA models applied to internal validation, in-house-AA and external-AA data
Comparison of original conformal prediction model for androgen receptor antagonism at 0.2 SL with other studies from literature
| Model | Validity | Efficiency | Accuracy | |||
|---|---|---|---|---|---|---|
| All | Class 1c | Class 0c | All | Class 1c | Class 0c | |
| KnowTox-AA | 0.81 | 0.82 | 0.81 | 0.87 | 0.80 | 0.78 |
| eMolTox [ | – | 0.76–0.81 | 0.81–0.82 | 0.94–0.99 | – | – |
| Norinder et al. [ | 0.80–0.81 | 0.81–0.83 | 0.79–0.82 | – | 0.79–0.82 | 0.78–0.79 |
aValues of models fitted on two different AA datasets.
bThree models with different fingerprints trained on one AA dataset
cclass 1 = actives, class 0 = inactives
Information on KnowTox-AA and other CP methods using the random forest ML algorithm to predict androgen receptor antagonism
| Method | Data source: actives/inactives | CP aggregation methoda | Descriptors |
|---|---|---|---|
| KnowTox-AA | ToxCast: 868/5842 | ACP | Morgan+MACCS ( |
| eMolTox [ | (1) 532/6207 (2) 406/6256 | ACP | |
| Norinder et al. [ | Jensen et al. [ 293/637 | CCP | (1) Dragon (2) Signatures (3) Physchem |
aACP aggregated conformal predictor, CCP cross-conformal predictor [48]
bTwo models ((1), (2)) fitted on two different AA datasets
cThree models ((1), (2), (3)) with different fingerprints trained on one AA dataset
dData for a total of 174 CP models originated from ChEMBL, Pubchem, Toxnet, eChemPortal databases and literature [35]
Fig. 4ToxCast, in-house and external data are projected into the descriptor space of a 2-component PCA trained on ToxCast-AA data
Evaluation of normalised + balanceda conformal prediction model for androgen receptor antagonism at 0.2 SL
| Dataset | Purpose | Validity | Efficiency | Accuracy | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| All | cl.1b | cl.0b | All | cl.1b | cl.0b | All | cl.1b | cl.0b | ||
| ToxCast-AA | train model | 0.85 | 0.84 | 0.85 | 0.57 | 0.39 | 0.60 | 0.89 | 0.76 | 0.91 |
| In-house-AA | validation I | 0.90 | 0.90 | 0.89 | 0.20 | 0.18 | 0.23 | 0.75 | 0.80 | 0.71 |
| External-AA | validation II | 0.80 | 0.76 | 0.82 | 0.43 | 0.33 | 0.52 | 0.74 | 0.67 | 0.78 |
anormalised nc score and balancing of calibration and proper training set
bcl.: class (class 1 = actives, class 0 = inactives)
Fig. 5Evaluation of final 88 CP models. Top: validity vs. efficiency for inactives (blue) and actives (grey). Bottom: sorted, overall accuracy
Fig. 6KnowTox tool applied in a case study. aTriazoles1&2 used as query compounds for the case study. b Output of CP. Grey: number of endpoints per family available for CP. Red and blue: number of endpoints where triazoles1&2 were predicted to be active (SCP) at SL 0.2. c Three selected toxic alerts found for triazoles1&2. (Note that the potentially critical “triazole” substructure is not considered in this work). dTriazole1 (left) and triazole2 (right) and their most similar molecules in ToxCast including CAS number and Tanimoto similarity. Red: maximum common substructure. e Experimental information from ToxCast for propiconazole (left) and bromuconazole (right). Grey: available assays in ToxCast. Blue: assays where compound was tested active
Fig. 7Chemical structures of triazole and imidazole fungicides referred to in the case study section