| Literature DB >> 35758812 |
Philip Fradkin1,2, Adamo Young2,3, Lazar Atanackovic1,2, Brendan Frey1,2,3, Leo J Lee1,2, Bo Wang2,3,4,5.
Abstract
MOTIVATION: Molecular carcinogenicity is a preventable cause of cancer, but systematically identifying carcinogenic compounds, which involves performing experiments on animal models, is expensive, time consuming and low throughput. As a result, carcinogenicity information is limited and building data-driven models with good prediction accuracy remains a major challenge.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35758812 PMCID: PMC9235510 DOI: 10.1093/bioinformatics/btac266
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
Fig. 1.A graphical summary of the core CONCERTO components. In blue (top) is the GNN transformer, which takes in the graph representation of the molecule. In green (bottom) is the predictor which consists of fingerprint representation of the molecule that is fed into the multilayer perceptron along with the GNN representation. The two parts are jointly optimized with multi-round pre-training (orange - right) to generate carcinogenicity prediction (A color version of this figure appears in the online version of this article)
Summary statistics of chemical compound carcinogenicity datasets
| Dataset | Experiment type | No. of Experiments | (+) labels | (-) labels |
|---|---|---|---|---|
| CPDB | C | 6540 | 509 | 494 |
| CCRIS | C + M | 88056 | 2674 | 2099 |
| Hansen | M | N/A | 3403 | 2909 |
Note: Under Experiment Type: C stands for carcinogenic experiments, M stands for mutagenic experiments. A significant fraction of compounds is present in multiple databases.
Model performances on CPDB and CCRIS
| Model | CPDB | CPDB | CCRIS | CCRIS |
|---|---|---|---|---|
| Pearson | MSE | ROC AUC | PR AUC | |
| CONCERTO |
|
|
|
|
| Fingerprint MLP | 0.36 ± 0.07 | 0.81 ± 0.08 |
| 0.64 ± 0.01 |
| GROVER MLP | 0.15 ± 0.16 | 0.83 ± 0.07 |
| 0.69 ± 0.01 |
| CarcinoPred-EL Average RF | — | — |
| 0.65 ± 0.02 |
| CarcinoPred-EL Pubchem RF | — | — |
| 0.61 ± 0.01 |
| Fingerprint RF—CarcinoPred-EL alike | 0.35 ± 0.04 | 1.17 ± 0.06 |
| 0.64 ± 0.01 |
| Fingerprint AdaBoost—Limbu |
| 0.8 ± 0.08 |
| 0.65 ± 0.01 |
Note: ROC and PR values accompany plots a, b from Figure 2 and are calculated only over values for which CarcinoPred-EL is defined for. CarcinoPred-EL was trained on CPDB so we are unable to generate predictions without confounding overfitting. Instead, we use CarcinoPred-EL dataset to train a random forest similar to their proposed method and use it to evaluate its performance on the CPDB dataset. Uncertainty is calculated using standard deviation over data re-sampled with replacement (bootstrapping). We use one sided DeLong test to assess statistical significance differences of ROC AUC values and indicate P-value in parentheses. Standard deviations are indicated after the values as a measure of uncertainty. For ROC AUC significance values are indicated in paranthesis comparing to full CONCERTO model (0.73 ROC auc) were represented in Bold
Fig. 2.(a, b) ROC and Precision–Recall plots demonstrating performance gains of CONCERTO (solid) over previous state of the art (dashed) on an external test dataset, CCRIS. (c–e) Correlation between log reciprocal TD50 values and model predictions on the CPDB test set. A clustered set of points at −1.62 carcinogenicity values indicates experiments in which no tumor growth was observed in animals
Ablation experiments for CONCERTO models measuring the impact of GNN transformer, and multi-round mutagenicity pre-training
| Experiment | CPDB correlation | CCRIS ROC |
|---|---|---|
| Fingerprint + GROVER + multi-round mutagenicity pre-training |
|
|
| Fingerprint + GROVER + mutagenicity pre-training |
|
|
| Fingerprint + GROVER |
|
|
| Fingerprint | 0.17 ± 0.17 | 0.60 ± 0.10 |
Note: All architectures contain the MLP-fingerprint predictor. Results are averaged over 50 random seed runs. Standard deviation is computed over the random seed results. In parentheses are P values from a two-sided t-test comparing the performances from 50 models in the current cell to the cell below (***P < 0.001).
Fig. 3.Example of a counterfactual analysis. On the x-axis, Tanimoto distances (1—Tanimoto similarity) are shown between sampled molecules and the original molecule. On the y-axis, the predicted carcinogenicity relative to test set carcinogenicity distribution is shown. For each molecule (grey point), we visualize a positive (red point) and a negative (blue point) counterfactual examples. The average within dataset diversity as measured by Tanimoto distances is 0.88. Red lines indicate prediction threshold beyond which we consider a sampled molecule a counterfactual, while blue line indicates model prediction of the base molecule (A color version of this figure appears in the online version of this article)
This table demonstrates the relative frequency of toxicophores in the test set and the corresponding positive and negative counterfactuals
| Toxicophore | SMARTS | Substructure representation | % in negative counterfactuals | % in original molecules | % in positive counterfactuals |
|---|---|---|---|---|---|
| Nitroso | N=O |
|
| 21.95 | 32.17 |
|
|
|
| |||
| Aliphatic halide | ClA, BrA, IA |
| 11.93 | 13.82 | 18.26 |
|
|
|
| |||
| Aromatic nitro | O=[N+]([O−])a |
|
| 10.57 | 7.83 |
|
|
|
| |||
| Aromatic amine | [NH2]a |
| 3.67 | 4.88 | 10.43 |
|
|
|
| |||
| Three-membered heterocycle | C1C[NH]1, C1CO1, C1CS1 |
| 0.00 | 0.81 | 5.22 |
|
|
|
| |||
| Azo-type | N = N |
| 0.00 | 0.81 | 1.74 |
|
|
|
| |||
| Unsubsituted heteroatom-bonded heteroatom | N[NH2], N[OH], O[OH], O[NH2] |
| 0.0 | 0.0 |
|
|
|
|
|
Note: SMARTS are an alternative molecular string representation allowing flexible tokens for aromatic and aliphatic atoms (Landrum, 2016). In parentheses is indicated the odds ratio relative to the % of toxicophores found in original counterfactuals. Significance is calculated using fisher’s exact test over ratios of substructure matches between counterfactual and original molecules. P values are adjusted using Benjamini–Hochberg correction (*P < 0.05).
Fig. 4.Analysis of dataset distances are generated by calculating MMD over Tanimoto scores. For carcinogenicity, there are two datasets that we further subdivide into positive and negative labels, creating four partitions. For each pair of partitions, we calculate corresponding distances and visualize using a heatmap