| Literature DB >> 31018579 |
Daniel J Cooper1, Stephan Schürer2,3.
Abstract
The Toxicology in the 21st Century (Tox21) project seeks to develop and test methods for high-throughput examination of the effect certain chemical compounds have on biological systems. Although primary and toxicity assay data were readily available for multiple reporter gene modified cell lines, extensive annotation and curation was required to improve these datasets with respect to how FAIR (Findable, Accessible, Interoperable, and Reusable) they are. In this study, we fully annotated the Tox21 published data with relevant and accepted controlled vocabularies. After removing unreliable data points, we aggregated the results and created three sets of signatures reflecting activity in the reporter gene assays, cytotoxicity, and selective reporter gene activity, respectively. We benchmarked these signatures using the chemical structures of the tested compounds and obtained generally high receiver operating characteristic (ROC) scores, suggesting good quality and utility of these signatures and the underlying data. We analyzed the results to identify promiscuous individual compounds and chemotypes for the three signature categories and interpreted the results to illustrate the utility and re-usability of the datasets. With this study, we aimed to demonstrate the importance of data standards in reporting screening results and high-quality annotations to enable re-use and interpretation of these data. To improve the data with respect to all FAIR criteria, all assay annotations, cleaned and aggregate datasets, and signatures were made available as standardized dataset packages (Aggregated Tox21 bioactivity data, 2019).Entities:
Keywords: FAIR data; Tox21; benchmarking; data standards; high-throughput screening; metadata; ontologies; signatures
Mesh:
Substances:
Year: 2019 PMID: 31018579 PMCID: PMC6515292 DOI: 10.3390/molecules24081604
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Figure 1Data processing workflow. (A) Toxicology in the 21st Century (Tox21) project data were downloaded from PubChem and combined to a singular file for analysis. (B) Data were filtered based on reported sample purity and aggregated by unique compounds. (C) Three activity categories were benchmarked based on their chemical structures using Laplacian-corrected Naïve Baysian classification. Label randomization led to random predictions as expected. (D) Validated filtered and aggregated datasets were used to analyze single compound promiscuity, and scaffold promiscuity after clustering by chemical structure (D).
Data Matrix Statistics.
| Set Number | Compounds Tested | Number of Assays * | Total Data Points |
|---|---|---|---|
|
| 5157 | 22 | 113,454 |
|
| 3354 | 42 | 140,868 |
|
| 63 | 4 | 252 |
|
| 5157 | 68 | 254,574 |
|
| 127,287 |
* Number of assays represents total of reporter assays + toxicity assays, based on compounds remaining after filtration; ** Data pairs represents the number of data points aggregated to reporter/toxicity assay pairings which are shown in Supplementary Table S3.
Figure 2Benchmarking Machine Learning Results. (A) Compounds were evaluated for active versus inactive class in three distinct categories (see text for details) based on their chemical structures. Receiver operating characteristic (ROC) scores were calculated based on leave one out cross validation. Results for randomized labels are shown in green. (B) Box plots indicate overall ROC score distribution for each category; arithmetic means are indicated by dashed lines; error bars = standard deviation.
Figure 3Promiscuity-based classification of individual compounds. (A) Reporter activity PI relative to toxicity activity PIs classifies compounds into groups of selective versus promiscuous (measured by relative reporter activity PI) and inert versus cytotoxic (measured by relative toxicity activity PI). Selected members of the selective cytotoxic (blue boxes), promiscuous cytotoxic (red boxes), and promiscuous inert (green box) groups were examined in more depth (Table 2). (B) Reporter selective PIs plotted relative to toxicity activity PIs. Examples of CIDs with high reporter selective PIs and low toxicity PIs (green box) were examined in-depth (Table 2). Data points are colored by total activity PI z-scores indicative of general promiscuity and were jittered on x- and y-axes for clarity due to non-continuous values of the PI values.
Examples of Promiscuous Compounds.
| Compounds with High Toxicity z-Scores | ||
|---|---|---|
| Sample compound 1 | Sample compound 2 | Sample compound 3 |
|
| ||
| Sample compound 1 | Sample compound 2 | Sample compound 3 |
|
| ||
| Sample compound 1 | Sample compound 2 | Sample compound 3 |
|
| ||
| Sample compound 1 | Sample compound 2 | Sample compound 3 |
|
| ||
| Sample compound 1 | Sample compound 2 | Sample compound 3 |
|
| ||
| Sample compound 1 | Sample compound 2 | Sample compound 3 |
* CID 11219835 identified specifically by Auld et al. [33] as a potent luciferase inhibitor. PI = promiscuity indices.
Figure 4Promiscuity-based classification of chemotype clusters. Similar to Figure 3 for individual compounds, chemotype clusters were visualized based on reporter activity PIsand toxicity activity PIs. Each cluster is represented by one data point sized by number of members within the cluster and colored by total activity PI z-scores. Most clusters resulted as specific and inert (lower left corner). Two clusters, designated A and B, were examined further because of cluster size and high reporter/toxicity PI ratio (A), and molecular scaffold similarity to luciferase inhibitors (B).