| Literature DB >> 35566372 |
Lama Moukheiber1, William Mangione2, Mira Moukheiber2, Saeed Maleki3, Zackary Falls2, Mingchen Gao1, Ram Samudrala2.
Abstract
Humans are exposed to numerous compounds daily, some of which have adverse effects on health. Computational approaches for modeling toxicological data in conjunction with machine learning algorithms have gained popularity over the last few years. Machine learning approaches have been used to predict toxicity-related biological activities using chemical structure descriptors. However, toxicity-related proteomic features have not been fully investigated. In this study, we construct a computational pipeline using machine learning models for predicting the most important protein features responsible for the toxicity of compounds taken from the Tox21 dataset that is implemented within the multiscale Computational Analysis of Novel Drug Opportunities (CANDO) therapeutic discovery platform. Tox21 is a highly imbalanced dataset consisting of twelve in vitro assays, seven from the nuclear receptor (NR) signaling pathway and five from the stress response (SR) pathway, for more than 10,000 compounds. For the machine learning model, we employed a random forest with the combination of Synthetic Minority Oversampling Technique (SMOTE) and the Edited Nearest Neighbor (ENN) method (SMOTE+ENN), which is a resampling method to balance the activity class distribution. Within the NR and SR pathways, the activity of the aryl hydrocarbon receptor (NR-AhR) and the mitochondrial membrane potential (SR-MMP) were two of the top-performing twelve toxicity endpoints with AUCROCs of 0.90 and 0.92, respectively. The top extracted features for evaluating compound toxicity were analyzed for enrichment to highlight the implicated biological pathways and proteins. We validated our enrichment results for the activity of the AhR using a thorough literature search. Our case study showed that the selected enriched pathways and proteins from our computational pipeline are not only correlated with AhR toxicity but also form a cascading upstream/downstream arrangement. Our work elucidates significant relationships between protein and compound interactions computed using CANDO and the associated biological pathways to which the proteins belong for twelve toxicity endpoints. This novel study uses machine learning not only to predict and understand toxicity but also elucidates therapeutic mechanisms at a proteomic level for a variety of toxicity endpoints.Entities:
Keywords: drug behavior; enrichment analysis; feature selection; high-throughput screening; machine learning; proteomic signature; random forest; structure–activity relationships; toxicity
Mesh:
Substances:
Year: 2022 PMID: 35566372 PMCID: PMC9099959 DOI: 10.3390/molecules27093021
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.927
Figure 1Overview of study workflow and pipeline for toxicity feature identification. As part of the data pre-processing step, the twelve Tox21 assay datasets containing the SMILES strings and activity class of the compounds were merged. The compounds were normalized and standardized using the MolVs library [50], built on RDKit where compound duplicates with ambiguous activity labels (i.e., equally active and inactive labels for the same compound) were removed. Model features were generated using a protein-compound interaction matrix via the CANDO platform. The data containing the features and the class activity for each compound were generated for each of the twelve assays and split into training and test sets. The SMOTE+ENN algorithm was applied to oversample the minority class and obtain an augmented training subset used to train our random forest classifier. The parameters for the random forest classifier were selected using tenfold cross-validation to attain optimal model performance. The model was evaluated on the unseen test data with metrics such as F1-score, recall, precision, specificity, balanced accuracy, AUCROC, and AUPRC to evaluate its performance. The model was then used to obtain the top 100 important features (protein structures); proteins were associated with pathways annotated in Reactome. The overrepresented pathways in the top 100 proteins via enrichment analysis were identified. The enriched pathways for the NR-AhR assay were analyzed as a case study, which illustrated that our unique proteomic feature selection pipeline allows for a mechanistic understanding of compound toxicity.
Details of compounds and their activity in the Tox21 dataset. Details are given for the pre-proccessed Tox21 Data Challenge training and test sets, including the assay identifier, target, and total number of compounds assayed, as well as the partitioning of the training and test sets. We compute the imbalanced ratio (IR) for each assay, which is the ratio of the inactive non-toxic compounds to the active toxic compounds. An imbalanced ratio value closer to one signifies a fairly balanced class activity, and a value much greater than one signifies a very imbalanced class activity. The imbalanced ratio values indicate that the Tox21 dataset is highly imbalanced and that model performance can be improved by taking this into account.
| In Vitro qHTS | Target/Assay | Number | Training Set | Test Set | ||||
|---|---|---|---|---|---|---|---|---|
| Inactive | Active | IR | Inactive | Active | IR | |||
| NR-AhR | Aryl hydrocarbon | 7103 | 5777 | 734 | 7.87 | 521 | 71 | 7.34 |
| NR-ER-LBD | Estrogen receptor | 7509 | 6643 | 282 | 23.56 | 564 | 20 | 28.20 |
| NR-ER | Estrogen receptor | 6630 | 5474 | 651 | 8.41 | 456 | 49 | 9.31 |
| NR-Aromatase | Aromatase | 6286 | 5496 | 274 | 20.06 | 479 | 37 | 12.94 |
| NR-PPAR- | Peroxisome | 7039 | 6283 | 167 | 37.62 | 559 | 30 | 18.63 |
| NR-AR | Androgen receptor | 7783 | 6958 | 252 | 27.61 | 561 | 12 | 46.75 |
| NR-AR-LBD | Androgen receptor | 7298 | 6521 | 211 | 30.90 | 558 | 8 | 69.75 |
| SR-MMP | Mitochondrial | 6316 | 4899 | 888 | 5.52 | 474 | 55 | 8.62 |
| SR-ARE | Nuclear factor | 6339 | 4919 | 881 | 5.58 | 450 | 89 | 5.06 |
| SR-ATAD5 | Genotoxicity indicated | 7646 | 6787 | 256 | 26.51 | 569 | 34 | 16.73 |
| SR-p53 | DNA damage | 7358 | 6351 | 409 | 15.53 | 560 | 38 | 14.74 |
| SR-HSE | Heat shock factor | 7040 | 6144 | 305 | 20.14 | 574 | 17 | 33.76 |
Model evaluation metrics computed for the twelve Tox21 assay datasets. For each assay, the model’s performance using F1-score, precision, recall, specificity, Mathews correlation coefficient (MCC), balanced accuracy (BA), AUROC, and AUPRC metrics is given. BA is the average of the recall and specificity and is a useful metric when evaluating imbalanced data. The overall performance of our pipeline is promising, particularly in terms of BA, depending on assays and metrics considered.
| Assays | F1 | Precision | Recall | AUCROC | AUPRC | BA | MCC | Specificity | Accuracy |
|---|---|---|---|---|---|---|---|---|---|
| NR-AhR | 0.471 | 0.318 | 0.901 | 0.896 | 0.560 | 0.819 | 0.438 | 0.737 | 0.757 |
| NR-ER-LBD | 0.342 | 0.333 | 0.350 | 0.810 | 0.272 | 0.663 | 0.318 | 0.975 | 0.954 |
| NR-ER | 0.420 | 0.301 | 0.694 | 0.806 | 0.414 | 0.760 | 0.370 | 0.827 | 0.814 |
| NR-Aromatase | 0.317 | 0.250 | 0.432 | 0.795 | 0.282 | 0.666 | 0.260 | 0.900 | 0.866 |
| NR-PPAR- | 0.286 | 0.308 | 0.267 | 0.745 | 0.241 | 0.617 | 0.251 | 0.968 | 0.932 |
| NR-AR | 0.261 | 0.273 | 0.250 | 0.706 | 0.196 | 0.618 | 0.178 | 0.988 | 0.970 |
| NR-AR-LBD | 0.000 | 0.000 | 0.000 | 0.618 | 0.036 | 0.493 | −0.014 | 0.986 | 0.972 |
| SR-MMP | 0.488 | 0.331 | 0.927 | 0.916 | 0.597 | 0.855 | 0.478 | 0.783 | 0.798 |
| SR-ARE | 0.425 | 0.305 | 0.697 | 0.757 | 0.403 | 0.692 | 0.294 | 0.687 | 0.688 |
| SR-ATAD5 | 0.325 | 0.283 | 0.382 | 0.744 | 0.230 | 0.662 | 0.282 | 0.942 | 0.910 |
| SR-p53 | 0.235 | 0.159 | 0.447 | 0.830 | 0.198 | 0.643 | 0.182 | 0.839 | 0.814 |
| SR-HSE | 0.286 | 0.308 | 0.267 | 0.759 | 0.240 | 0.617 | 0.251 | 0.968 | 0.932 |
Figure 2Comparison of model performance across the twelve Tox21 assays. The performance of the twelve Tox21 assays is evaluated using AUROC (top) and balanced accuracy (bottom). Although AUCROC is widely used as a binary classifier evaluation metric, it can be misleading for imbalanced classification with few examples of the minority class. We compared our results in blue to the three other studies which utilized a random forest-based classifier. (The microsome study does not report both AUCROC and BA for the NR-AR, NR-AR-LBD, and SR-MMP assays and the BA for NR-PPAR-.) Our model performs comparably to previous methods yet allows for the extraction of important protein features implicated in each toxicity endpoint.
Important mechanistic pathways and their proteins for the toxicity endpoint of NR-AhR. The name of the pathway, the number of proteins present in the pathway according to Reactome, the proteins in the pathway selected by the model as important for predicting AhR toxicity, and the p-value from the enrichment analysis are shown. The p-value is derived using the hypergeometric distribution based on how many total important proteins were selected by the model, how many total proteins are present in the pathway, and the total number of proteins in the human proteome. The value shown is the probability of selecting at least that number of proteins present in the pathway. These results indicate that our pipeline is capable of extracting higher level biological explanations associated with AhR toxicity that have been validated via the literature.
| Pathway | Total Proteins | Selected Proteins | |
|---|---|---|---|
| Nonhomologous End-Joining (NHEJ) | 52 | RNF8,UBE2N, | 1.36 × 10 |
| Recruitment and ATM-mediated phosphorylation of repair | 59 | RNF8,UBE2N, | 2.22 × 10 |
| TRAF6 mediated NF- | 24 | TRAF2,TRAF6 | 2.35 × 10 |
| DNA Double Strand Break Response | 60 | RNF8,UBE2N, | 2.36 × 10 |
| TRAF6 mediated IRF7 activation | 28 | TRAF2,TRAF6 | 3.73 × 10 |
| Neurofascin interactions | 7 | NRCAM,CNTN1 | 5.28 × 10 |
| DDX58/IFIH1-mediated induction of interferon-alpha/beta | 77 | TRAF2,RNF125, | 6.04 × 10 |
| RUNX3 regulates YAP1-mediated transcription | 8 | TEAD1,TEAD4 | 7.01 × 10 |
| SUMOylation of transcription cofactors | 42 | RNF2,UHRF2,PIAS3 | 1.22 × 10 |
| IRAK1 recruits IKK complex | 14 | TRAF6,UBE2N | 2.21 × 10 |
| IRAK1 recruits IKK complex upon TLR7/8 or 9 stimulation | 14 | TRAF6,UBE2N | 2.21 × 10 |
| YAP1- and WWTR1 (TAZ)-stimulated gene expression | 14 | TEAD1,TEAD4 | 2.21 × 10 |
| TRAF6 mediated IRF7 activation in TLR7/8 or 9 signaling | 14 | TRAF6,UBE2N | 2.21 × 10 |
| TICAM1, RIP1-mediated IKK complex recruitment | 19 | TRAF6,UBE2N | 4.05 × 10 |
| Signal transduction by L1 | 20 | NRP1,NCAM1 | 4.48 × 10 |
| G2/M DNA damage checkpoint | 78 | RNF8,UBE2N,BRCA1, | 4.64 × 10 |
| Regulation of FZD by ubiquitination | 21 | LRP6,LGR5 | 4.92 × 10 |
| IKK complex recruitment mediated by RIP1 | 22 | TRAF6,UBE2N | 5.39 × 10 |
| JNK (c-Jun kinases) phosphorylation and | 22 | TRAF6,UBE2N | 5.39 × 10 |
| Processing of DNA double-strand break ends | 81 | RNF8,UBE2N,BRCA1, | 5.55 × 10 |
| Activated TAK1 mediates p38 MAPK activation | 23 | TRAF6,UBE2N | 5.87 × 10 |
| Formation of Incision Complex in GG-NER | 43 | UBE2N,PIAS3, | 6.52 × 10 |
| Recognition of DNA damage by | 31 | RBX1,RPA1 | 1.03 × 10 |
| TAK1 activates NFkB by phosphorylation | 32 | TRAF6,UBE2N | 1.10 × 10 |
| DNA strand elongation | 32 | GINS2,RPA1 | 1.10 × 10 |
| Sialic acid metabolism | 33 | GLB1,NANP | 1.16 × 10 |
| Transcriptional Regulation by E2F6 | 34 | RNF2,BRCA1 | 1.23 × 10 |
| Negative regulators of DDX58/IFIH1 signaling | 34 | RNF125,DDX58 | 1.23 × 10 |
| NOD1/2 Signaling Pathway | 35 | TRAF6,UBE2N | 1.30 × 10 |
| RUNX1 interacts with co-factors whose | 36 | RNF2,PCGF5 | 1.37 × 10 |
| HDR through Single Strand Annealing (SSA) | 37 | BRCA1,RPA1 | 1.44 × 10 |
| Ovarian tumor domain proteases | 38 | TRAF6,DDX58 | 1.51 × 10 |
| Presynaptic phase of homologous DNA | 39 | BRCA1,RPA1 | 1.58 × 10 |
| Formation of Fibrin Clot (Clotting Cascade) | 39 | PROCR,GP1BB | 1.58 × 10 |
Figure 3Overview of the NR-AhR signaling pathway. AhR is an inactive cytosolic transcription factor bound to several co-chaperones. When a ligand passively diffuses through the cell membrane and binds to AhR, the ligand-receptor complex translocates into the nucleus, and the chaperones dissociate. Once in the nucleus, the AhR dimerizes with the AhR nuclear translocator (ARNT) forming an active heterodimer. The activated heterodimer complex interacts with DNA directly or indirectly by binding to recognition sequences located in the promoter regions of target genes, such as the xenobiotic response elements (XRE). BRCA1 modulates the expression of genes involved in xenobiotic stress responses [62], and was selected by our pipeline as an important protein feature in AhR toxicity. The binding of the heterodimer complex to the DNA activates the transcription of genes leading to proteins that affect inflammation, the cell cycle, and immunological response. The toxicity of the ligand-activated AhR can also be mediated by non-genomic action through enzymatic activation and the triggering of other pathways, such as the NF-B pathway, which involves the TRAF6 protein, another important protein selected by our pipeline. The interconnection of different pathways as shown exemplifies how our pipeline can decipher different mechanisms for AhR toxicity induced by the toxic compounds.
Figure 4Breast cancer type 1 susceptibility protein (BRCA1) activity in response to DNA damage. Double-stranded breaks in DNA activate the ataxia-telangiectasia mutated (ATM) kinase and Rad3-related protein (ATR) that subsequently phosphorylate checkpoint kinase 2 (CHK2) and checkpoint kinase 1 (CHK1). The ATR/ATM-mediated phosphorylation pathway phosphorylates the BRCA1 protein and activates proteins involved in DNA damage repair and apoptosis. The binding of toxic compounds or AhR ligands leads to the suppression of the BRCA1 promoter activity, ultimately leading to increased cellular toxicity via DNA repair dysfunction and cell cycle checkpoint disregulation. Our pipeline selected BRCA1 as important for predicting AhR toxicity, along with other proteins such as RNF8, UBE2N, RAP80, NSD2, and RPA1, implying that compounds are known to induce this phenotype possibly via the modulation of proteins directly involved in DNA repair and cell cycle regulation.
Figure 5Interconnectivity of the pathways leading to the activation of the NF- The tumor necrosis factor receptor 1 (TNFR1) plays an essential role in pro-inflammatory activities. Upon TNFR1’s stimulation TNF receptor-associated factor 2 (TRAF2) is recruited along with receptor-interacting serine/threonine-protein kinase 1 (RIP1), which in turn activates transforming growth factor -activated kinase 1 (TAK1). TAK1 can also be activated through the interleukin-1 receptor/toll-like receptor (IL-1R/TLR). Once IL-1R/TLR is activated, it triggers the activation of the TNF receptor-associated factor 6 (TRAF6) downstream. TRAF6 then combines with ubiquitin-conjugating enzymes Ubc13/UBE2N and ultimately activates TAK1. TAK1 then activates the nuclear factor-B (NF-B) signaling pathway, which recruits the NF-B transcription factor, leading to the production of cytokines. TAK1 can also induce a cascade of mitogen-activated kinases (MAPKs), including the c-Jun kinases (JNKs) or the p38 MAPK. This activates the activator protein 1 (AP-1) transcription factor leading to the production of cytokines. In addition to activating TAK1, TRAF6 can also activate downstream interferon regulatory factor 7 (IRF7) via the TLR7/8/9-MyD88 pathway, leading to the production of type 1 interferon. TRAF6, TRAF2, and UBE2N were identified by our pipeline as important for predicting AhR toxicity, implying that these proteins may be modulated by compounds known to induce this toxic phenotype via the depicted pro-inflammatory pathways.