Literature DB >> 29516034

QSAR ligand dataset for modelling mutagenicity, genotoxicity, and rodent carcinogenicity.

Davy Guan1, Kevin Fan1, Ian Spence1, Slade Matthews1.   

Abstract

Five datasets were constructed from ligand and bioassay result data from the literature. These datasets include bioassay results from the Ames mutagenicity assay, Greenscreen GADD-45a-GFP assay, Syrian Hamster Embryo (SHE) assay, and 2 year rat carcinogenicity assay results. These datasets provide information about chemical mutagenicity, genotoxicity and carcinogenicity.

Entities:  

Year:  2018        PMID: 29516034      PMCID: PMC5835004          DOI: 10.1016/j.dib.2018.01.077

Source DB:  PubMed          Journal:  Data Brief        ISSN: 2352-3409


Specifications Table Value of the data This article contains the largest public collection of ligands and results for the GreenScreen GADDα-45 and Syrian Hamster Embryonic Cell Transformation assays collated from previous literature to date. A benchmark dataset of pharmaceutically relevant ligands for use in rat carcinogenicity QSAR models is presented and compared with ligands from regulatory domains. Physiochemical descriptors were calculated from the SMILES structures and selected for QSAR model performance.

Data

The creation of a QSAR model for the 2-year rodent carcinogenicity bioassay is highly desirable since it is the gold standard for assessing potential chemical carcinogenicity. However, previous modelling efforts have been hampered due to data availability and reliability issues stemming from bioassay limitations such as low throughput, high cost, and modest reproducibility between laboratories and rodent species. The in vivo carcinogenicity datasets in this article are solely rat carcinogenicity outcomes due to previous literature finding the rat carcinogenicity bioassay produces better endpoint reliability in comparison to the mouse carcinogenicity bioassay. This article presents two rat carcinogenicity datasets from the regulatory toxicology and pharmaceutical safety chemical domains. Genotoxicity occurs from chemicals acting with genomic mechanisms of toxicity and this has been associated with potential carcinogenicity. This endpoint type features many in vitro bioassays with larger libraries of screened molecules in comparison to in vivo rodent carcinogenicity bioassay data. QSAR models capable of utilizing this data in combination with rodent carcinogenicity data may address the limited applicability domain of the in vivo data. The data was exhaustively collated from the in vitro GreenScreen GADD-45 and Syrian Hamster Embryonic bioassays from the literature. Previous literature found concordance between these bioassays and in vivo rodent carcinogenicity outcomes. The Ames Bacterial Mutagenicity Benchmark Dataset has also been included for comparison.

Experimental design, materials and methods

Dataset preparation

ISSCAN: 854 chemical database of in vivo rat carcinogenicity from [1]. PHARM: in vivo rodent carcinogenicity results on pharmaceutical chemicals from [2]. GreenScreen: 1415 GADD-45a-GFP assay results from [3], [4], [5], [6], [7], [8], [9], [10]. Syrian Hamster Embryonic: Data on 1415 chemicals extracted from [11], [12]. Ames: 6512 Ames results from [13].

Dataset curation

SMILES structures were generated using ChemAxon JChem for Office from CAS Numbers or chemical names. These structures were curated using ChemAxon Standardizer to remove salts and solvents and aromatized.

Descriptor selection

The CfsSubsetEval algorithm [14] selected subsets of structural descriptors (generated by PaDEL Descriptor [15]) for each dataset.

Applicability domain quantification

The applicability domain of each dataset compared to the PHARM dataset was quantified using leverage, Euclidean distance from centroid, and a variable knn-based distance [16] measures.

QSAR model Attribute Importance Scores

Attribute Importance Scores were calculated from each RandomForest QSAR model (Fig. 1, Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, Table 7).
Fig. 1

Box and whisker plots depicting the percentage of PHARM validation dataset samples retained within the applicability domain of the processed Ames (A), GSX (B), SHE (C), ISC (D) modelling datasets over different k values. k values in the variable knn-based distance measure were optimized using 20% of samples in the PHARM validation dataset, 1000 iterations, and a maximum k value of 25.

Table 1

Summary of initial datasets after curation.

Name/sourceBioassay/data sourcen+/− Results% Balance (+/−)
Ames MutagenicityAmes Bacterial Mutagenicity Benchmark Dataset65123503/300954:46
Syrian Hamster EmbryonicSyrian Hamster Embryonic Cell Transformation Assay (pH 7)356232/12465:35
GreenScreenGADDα-45 GFP Literature Results1415163/125212:88
ISSCANISSCAN expert rat carcinogenicity calls854510/34460:40
PHARMRat carcinogenicity literature results374134/24036:64
Table 2

Summary of modelling datasets after structural descriptor calculation and selection.

Name# Desc.n/Desc
Ames9072.4
GSX2193.1
SHE588.28
ISC5123.5
PHARM1875
Table 3

Applicability Domain (AD) quantification comparing each processed QSAR modelling dataset to the PHARM dataset.

NameAD measureInside ADOutside AD
Amesknn distance35024
Distance from centroid35618
Leverage35123
GSXknn distance35519
Distance from centroid35321
Leverage35222
SHEknn distance35717
Distance from centroid35618
Leverage33143
ISCknn distance32549
Distance from centroid33044
Leverage30965
Table 4

Attribute Importance Scores (AIS) and node membership for structural descriptors in the GreenScreen (GSX) QSAR model.

Structural descriptorAISNode membership
AATS2m0.3720,529
naAromAtom0.3515,136
GATS2m0.3517,920
GATS2i0.3516,693
SpMin1_Bhp0.3415,952
ETA_BetaP0.3214,229
IC30.314,615
nAcid0.34138
nAtomLC0.2911,238
nssO0.299829
MDEC-330.2912,339
SHBint60.297607
R_TpiPCTPC0.2812,038
RDF50m0.287741
MPC40.2813,116
L3m0.277252
De0.267117
nHAvin0.233773
MDEN-130.214086
n3HeteroRing0.181316
Table 5

Attribute Importance Scores (AIS) and node membership for structural descriptors in the ISSCAN (ISC) QSAR model.

Structural descriptorAISNode membership
ALogp20.413,058
AATSC0p0.3811,714
ATSC7e0.389518
MATS3s0.3711,996
BCUTp-1l0.3710,779
MATS1e0.3711,671
BCUTw-1h0.359596
nN0.356494
nAcid0.341640
nCl0.343703
P1s0.337280
C3SP20.34817
nHBint40.32765
nHBint20.293028
nssCH20.295294
topoShape0.295618
nHBint30.293033
nHBint50.282579
nHBint60.281998
C1SP20.284941
nHCsatu0.273117
nAtomLAC0.263733
ndO0.264003
minsOH0.263633
maxsOH0.263248
nssO0.262945
nsssCH0.253217
maxsNH20.253572
maxssO0.253885
nssssC0.251917
ndssC0.253828
nBase0.251811
ETA_Beta_ns_d0.253798
SaaS0.24612
nHssNH0.241971
n6Ring0.244471
n6HeteroRing0.231761
mindsN0.231969
nHBDon0.233556
nsssN0.232034
LipinskiFailures0.231479
nsOm0.231527
maxdsN0.221748
n5HeteroRing0.211658
MDEN-230.22373
SRW50.192445
mintsC0.19641
SaaO0.18789
MDEN-130.18706
nT7Ring0.15574
nF11HeteroRing0.13393
Table 6

Attribute Importance Scores (AIS) and node membership for structural descriptors in the Syrian Hamster Embryonic (SHE) cell transformation assay QSAR model.

Structural descriptorAISNode membership
ATSC0e0.392283
AATSC4m0.392176
AATS0p0.382522
AATS7i0.381848
MATS2c0.372105
AATS1m0.362582
ATSC6s0.362284
ATSC8s0.361687
GATS2c0.351825
AATSC7m0.341812
MATS7m0.341589
MATS8c0.331571
MATS7p0.331573
GATS1c0.332219
GATS4m0.331943
VE3_Dzp0.322041
SpMin8_Bhp0.321655
nAcid0.31448
GATS8v0.311067
GATS8c0.311464
SpMin7_Bhs0.31857
VR2_Dt0.31707
GATS8i0.291172
naasC0.29808
TDB2i0.291128
C1SP20.29760
nsCH30.29974
TDB8u0.28672
minHBint30.28803
C1SP30.28992
TDB4e0.271057
ASP-60.271806
nRotBt0.271127
SCH-50.27594
Ds0.27972
SHsOH0.27897
RNCG0.271039
E3p0.27915
RDF10m0.27961
TDB1m0.271203
piPC90.271033
TDB3i0.261314
nsNH20.26400
nHsNH20.26496
RotBFrac0.261203
nssO0.25565
minsssN0.25376
nHeteroRing0.25490
TDB7r0.24835
maxssO0.23743
SRW50.23442
minHBint70.23424
nssssNp0.2325
nAtomLAC0.22871
MDEN-120.21208
nT6HeteroRing0.21384
nHCHnX0.2172
nFG12HeteroRing0.17258
Table 7

Attribute Importance Scores (AIS) and node membership for structural descriptors in the Ames QSAR model.

Structural descriptorAISNode membership
AATS4e0.4217,373
ATSC7c0.4215,115
ATSC2c0.4118,428
AATS2e0.4116,488
AATS1m0.4116,634
ATSC4m0.416,224
ATSC3m0.416,652
ATSC2v0.3815,374
ATSC3e0.3815,359
ATSC4i0.3714,988
ATSC2e0.3715,187
AATSC5c0.3714,636
ATSC2i0.3614,319
MATS6c0.3613,289
nAcid0.362038
MATS4c0.3615,264
ATSC1e0.3515,795
MATS4m0.3513,467
ATSC1i0.3514,449
MATS6i0.3412,963
GATS3c0.3413,301
GATS1c0.3314,472
GATS8m0.339386
AATSC1i0.3213,724
AATSC1e0.3213,859
GATS5v0.3212,664
VE1_Dzp0.3211,795
GATS3m0.3212,529
MATS1e0.3112,967
MATS1i0.312,082
BCUTc-1l0.39751
GATS1m0.312,484
SpMax1_Bhv0.2913,212
GATS2e0.2912,634
SpMin1_Bhp0.2913,295
GATS1p0.2911,559
GATS1i0.2911,251
SpMax1_Bhi0.2812,842
ASP-20.288374
ETA_EtaP0.2811,891
mindsssP0.27299
BCUTw-1h0.278210
BCUTw-1l0.275500
hmax0.2712,581
BIC20.2711,166
TDB9u0.274470
Mpe0.279879
Du0.276788
ETA_Epsilon_10.279185
MIC20.2610,461
nHCsats0.264146
MDEC-120.265914
ETA_Eta_L0.2611,100
E3i0.266433
JGT0.2610,771
TDB8i0.265577
RDF40m0.267058
ETA_Epsilon_40.259370
SHCsatu0.254686
BIC10.2511,006
MLFER_S0.2510,748
R_TpiPCTPC0.2511,100
ETA_dEpsilon_A0.258641
RDF20m0.257793
WTPT-50.259438
C3SP30.251330
piPC100.257381
SaaaC0.254592
RDF20s0.257448
L3u0.247228
SdCH20.24705
ETA_BetaP_ns_d0.245993
AMW0.248526
MLFER_A0.237826
nAtomP0.237347
minHssNH0.223200
ETA_Shape_X0.222980
MDEN-330.21990
mindsN0.212570
minsNH20.213578
SRW90.24461
MDEN-230.22559
maxHCHnX0.22082
minwHBd0.191599
nTG12Ring0.191527
nFG12Ring0.191633
SaaS0.18760
MDEN-130.17979
SCH-30.151699
VCH-30.141762
Box and whisker plots depicting the percentage of PHARM validation dataset samples retained within the applicability domain of the processed Ames (A), GSX (B), SHE (C), ISC (D) modelling datasets over different k values. k values in the variable knn-based distance measure were optimized using 20% of samples in the PHARM validation dataset, 1000 iterations, and a maximum k value of 25. Summary of initial datasets after curation. Summary of modelling datasets after structural descriptor calculation and selection. Applicability Domain (AD) quantification comparing each processed QSAR modelling dataset to the PHARM dataset. Attribute Importance Scores (AIS) and node membership for structural descriptors in the GreenScreen (GSX) QSAR model. Attribute Importance Scores (AIS) and node membership for structural descriptors in the ISSCAN (ISC) QSAR model. Attribute Importance Scores (AIS) and node membership for structural descriptors in the Syrian Hamster Embryonic (SHE) cell transformation assay QSAR model. Attribute Importance Scores (AIS) and node membership for structural descriptors in the Ames QSAR model.
Subject areaComputational Chemistry
More specific subject areaQuantitative Structure-Activity Relationship (QSAR) modelling
Type of dataRaw data (CSV files), processed data (ARFF files) with analysis
Data formatSMILES structures and bioassay results, selected descriptors
Experimental factorsData was gleaned for QSAR model development
Experimental featuresQSAR models were developed for each dataset using machine learning algorithms using calculated structural descriptors
Data source locationDiscipline of Pharmacology, Blackburn Building, University of Sydney, Australia
Data accessibilityRaw and processed data are presented as CSV and ARFF files, respectively, as supplementary data for this article
  14 in total

1.  An update on the genotoxicity and carcinogenicity of marketed pharmaceuticals with reference to in silico predictivity.

Authors:  Ronald D Snyder
Journal:  Environ Mol Mutagen       Date:  2009-07       Impact factor: 3.216

2.  Development and validation of a higher throughput screening approach to genotoxicity testing using the GADD45a-GFP GreenScreen HC assay.

Authors:  Andrew W Knight; Louise Birrell; Richard M Walmsley
Journal:  J Biomol Screen       Date:  2009-01

3.  GADD45a-GFP GreenScreen HC assay results for the ECVAM recommended lists of genotoxic and non-genotoxic chemicals for assessment of new genotoxicity tests.

Authors:  Louise Birrell; Paul Cahill; Chris Hughes; Matthew Tate; Richard M Walmsley
Journal:  Mutat Res       Date:  2009-12-16       Impact factor: 2.433

4.  Assessment of the genotoxicity of S9-generated metabolites using the GreenScreen HC GADD45a-GFP assay.

Authors:  Christopher Jagger; Matthew Tate; Paul A Cahill; Chris Hughes; Andrew W Knight; Nicholas Billinton; Richard M Walmsley
Journal:  Mutagenesis       Date:  2008-09-11       Impact factor: 3.000

5.  Evaluation of the GreenScreen GADD45alpha-GFP indicator assay with non-proprietary and proprietary compounds.

Authors:  Andrew Olaharski; Silvio Albertini; Stephan Kirchner; Stefan Platz; Hirdesh Uppal; Henry Lin; Kyle Kolaja
Journal:  Mutat Res       Date:  2008-09-04       Impact factor: 2.433

6.  Benchmark data set for in silico prediction of Ames mutagenicity.

Authors:  Katja Hansen; Sebastian Mika; Timon Schroeter; Andreas Sutter; Antonius ter Laak; Thomas Steger-Hartmann; Nikolaus Heinrich; Klaus-Robert Müller
Journal:  J Chem Inf Model       Date:  2009-09       Impact factor: 4.956

7.  PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints.

Authors:  Chun Wei Yap
Journal:  J Comput Chem       Date:  2010-12-17       Impact factor: 3.376

8.  Histone-deacetylase inhibitors produce positive results in the GADD45a-GFP GreenScreen HC assay.

Authors:  Donna Johnson; Richard Walmsley
Journal:  Mutat Res       Date:  2013-01-20       Impact factor: 2.433

Review 9.  Comparison of the standard and reduced pH Syrian hamster embryo (SHE) cell in vitro transformation assays in predicting the carcinogenic potential of chemicals.

Authors:  R J Isfort; G A Kerckaert; R A LeBoeuf
Journal:  Mutat Res       Date:  1996-09-21       Impact factor: 2.433

10.  Defining a novel k-nearest neighbours approach to assess the applicability domain of a QSAR model for reliable predictions.

Authors:  Faizan Sahigara; Davide Ballabio; Roberto Todeschini; Viviana Consonni
Journal:  J Cheminform       Date:  2013-05-30       Impact factor: 5.514

View more
  1 in total

1.  In Silico Approaches In Carcinogenicity Hazard Assessment: Current Status and Future Needs.

Authors:  Raymond R Tice; Arianna Bassan; Alexander Amberg; Lennart T Anger; Marc A Beal; Phillip Bellion; Romualdo Benigni; Jeffrey Birmingham; Alessandro Brigo; Frank Bringezu; Lidia Ceriani; Ian Crooks; Kevin Cross; Rosalie Elespuru; David M Faulkner; Marie C Fortin; Paul Fowler; Markus Frericks; Helga H J Gerets; Gloria D Jahnke; David R Jones; Naomi L Kruhlak; Elena Lo Piparo; Juan Lopez-Belmonte; Amarjit Luniwal; Alice Luu; Federica Madia; Serena Manganelli; Balasubramanian Manickam; Jordi Mestres; Amy L Mihalchik-Burhans; Louise Neilson; Arun Pandiri; Manuela Pavan; Cynthia V Rider; John P Rooney; Alejandra Trejo-Martin; Karen H Watanabe-Sailor; Angela T White; David Woolley; Glenn J Myatt
Journal:  Comput Toxicol       Date:  2021-09-23
  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.