Literature DB >> 31516335

CORAL: Building up QSAR models for the chromosome aberration test.

Andrey A Toropov¹, Alla P Toropova¹, Giuseppa Raitano¹, Emilio Benfenati¹.

Abstract

A high level of chromosomal aberrations in peripheral blood lymphocytes may be an early marker of cancer risk, but data on risk of specific cancers and types of chromosomal aberrations are limited. Consequently, the development of predictive models for chromosomal aberrations test is important task. Majority of models for chromosomal aberrations test are so-called knowledge-based rules system. The CORAL software (http://www.insilico.eu/coral, abbreviation of "CORrelation And Logic") is an alternative for knowledge-based rules system. In contrast to knowledge-based rules system, the CORAL software gives possibility to estimate the influence upon the predictive potential of a model of different molecular alerts as well as different splits into the training set and validation set. This possibility is not available for the approaches based on the knowledge-based rules system. Quantitative Structure-Activity Relationships (QSAR) for chromosome aberration test are established for five random splits into the training, calibration, and validation sets. The QSAR approach is based on representation of the molecular structure by simplified molecular input-line entry system (SMILES) without data on physicochemical and/or biochemical parameters. In spite of this limitation, the statistical quality of these models is quite good.

Entities: CellLine Chemical Disease Gene Species

Keywords: CORAL software; Chromosome aberration; Monte Carlo method; QSAR; SMILES

Year: 2018 PMID： 31516335 PMCID： PMC6734133 DOI： 10.1016/j.sjbs.2018.05.013

Source DB: PubMed Journal: Saudi J Biol Sci ISSN： 1319-562X Impact factor: 4.219

Introduction

There are large diversity of biochemical endpoints which should be available for development of medicinal biochemistry at least via computational models (Tenorio-Borroto et al., 2014, González-Díaz et al., 2013a, González-Díaz et al., 2013b, Prado-Prado et al., 2013, Duardo-Sanchez and Gonzalez-Diaz, 2013, Tenorio-Borroto et al., 2012, Riera-Fernández et al., 2012, González-Díaz et al., 2007). Mutagenicity and carcinogenicity are interrelated factors which can catastrophically impact human health (Toropova and Toropov, 2014). The necessity to assess risk of applying of various substances in the above aspect is vital necessity (Gollapudi et al., 2013). There are increase of the number of publications (2012–2017) dedicated to chromosome aberration assay according to PubMed. Importance of systematization of available data and definition of effective strategy for diagnostics and treatment of different cases of breast cancer accompanied by the chromosome aberration is noted by many authors (Grade et al., 2015, Ben-David et al., 2016, Hosein et al., 2010, Rennstam et al., 2003, Watters et al., 2003, Vulto-van Silfhout et al., 2013, Brookmire et al., 2013, Sun et al., 2015, Castro et al., 2006, Boffetta et al., 2007). The European REACH legislation (Registration, Evaluation, Authorization and Restriction of Chemicals) encourages to improve the safety of chemical substances, increase the research efforts and promote scientific innovation, including the use of alternative approaches to evaluate substances support (REACH, 2006). Among the in vitro tests required to identify mutagenic compounds, bacterial reverse mutation assay (Ames test) (Ames, 1979) and chromosome aberration test are frequently used in the first stages of the assessment for mutagenicity. It is to be noted that in spite of high influence of REACH legislation there are negative tendencies caused by REACH: (1) The registration process is very expensive, due to the high degree of experimental and administrative work required; and (2) At social level, REACH raises the ethical problem caused by the huge amount of animal testing necessary to meet the requirements of REACH Gozalbes and Vicente de Julián-Ortiz, 2018. More than 25 years ago, the OECD recognized the need to protect animals in general and, in particular, those used in experimental work. The progress in OECD on the harmonization of chemicals control, especially the agreement on Mutual Acceptance of Data (MAD), has greatly contributed to reduce the number of animals used in testing by avoiding duplicative testing. All OECD Test Guidelines (TGs) are available at the OECD website (http://www.oecd.org/env/ehs/testing/oecdguidelinesforthetestingofchemicals.htm). In the first case, the genotoxic potential of a target compound is determined by the detection of the renewed functional capability to synthesize the essential amino acid of an auxotrophic histidine-dependent strain of S. typhimurium. At the presence of that mutagen, the revertant bacteria can grow up on a medium without histidine (OECD, 2008a). In vitro chromosome aberration assay is used to identify agents that cause structural aberrations in mammalian cells. As for the Ames test, the target compounds are examined with and without metabolizing system since often the interaction with genetic material occurs after metabolic activation. After incubation with the chemical target at intended intervals, the cells are arrested in metaphase and analyzed microscopically looking for chromosomal aberrations. Many human genetic diseases are caused by chromosome mutations and there is evidence that they are also involved in the alterations of oncogenes and tumor suppressor genes of somatic cells in humans and experimental animals (OECD, 2008b). Chromosomal aberrations in peripheral blood lymphocytes have been used for decades for the surveillance of healthy individuals exposed to known or potential mutagens and carcinogens (Boffetta et al., 2007, Carrano and Natarajan, 1988). In addition, chromosome aberrations are typical features of neoplastic cells, and for certain cancers specific chromosome abnormalities are commonly present (Yunis, 1983). Although specific chromosome aberrations detected in neoplasms are generated during carcinogenesis, it has been hypothesized that the frequency of chromosomal aberrations represents a marker of susceptibility to cancer, based on the concept that genetic damage in peripheral blood lymphocytes reflects similar damage in different target cells undergoing carcinogenesis (Carrano and Natarajan, 1988, Umbuzeiro et al.,2016). Moreover, the chromosome aberration test is an important parameter of a substance also from the point of view of drug discovery (Nigam, 2009), cosmetics, and food industry (https://www.fda.gov/downloads/Drugs/Guidances/ucm074931.pdf). Due to their publicly and high quality availability, Ames test data have been used to develop several QSAR models that, during the last years, showed good performance predicting mutagenic activity (Claxton et al., 2010). In the case of the chromosomal aberration endpoint, the predictive models are few. This is probably due to the complexity of mechanism of its induction and the lower availability of high-quality experimental data. In addition, there are different models of the chromosome aberration test which involve topological indices together with physicochemical and biochemical parameters to build up a model (Votano, 2005, Jacobson-Kram and Contrera, 2007, Serra et al., 2003, Mohr et al., 2010, Rosenkranz, 2004, Rothfuss et al., 2006, Estrada and Molina, 2006). However, often, the involving of physicochemical and biochemical parameters is unavailable. Consequently, the using solely molecular structures without additional data is an attractive alternative for building up a model of chromosome aberration test. The CORAL (CORrealtion And Logic) software allows building up models of this kind. The aim of this study is the estimation of models for chromosome aberration test which are built up using the CORAL software (Toropova and Toropov, 2014).

Method

Data

Experimental data for this work were taken from the Genotoxicity OASIS Database (http://oasis-lmc.org/products/databases/rat-liver-metabolism-extended.aspx) and the Toxicity Japan MHLW (http://dra4.nihs.go.jp/mhlw_data/jsp/SearchPageENG.jsp) that include data for chromosomal aberrations determined by in vitro test using Chinese hamster lung (CHL) and ovary (CHO) cells, with and without metabolic activation (metabolic system S9). After removing duplicates we collected a set of 477 organic compounds: 223 are classified as active and 254 are classified as inactive in chromosomal aberrations test. For each compound, CAS number, simplified molecular input-line entry system (SMILES) and experimental data expressed as active (+1) or inactive (−1) are represented. Finally, SMILES have been normalized by the VEGA platform (www.vega-qsar.eu/). These compounds were randomly split into the training (80%), calibration (10%), and validation (10%) sets (five splits are examined). The CORAL software is developed with taking into account the following hypothesis: QSAR model is a random event (Toropov et al., 2013). In other words, the same approach that is used to build up a QSAR model gives quite different models for different splits into the training set and validation set. Thus, lucky splits (good statistical quality) and unlucky splits (poor statistical quality) take place for any total set that is used for the QSAR analysis. Consequently, in order to check up an approach really, one should examine a group of different distributions of available data into the training set (visible during building up a model) and the validation set (invisible during building up a model). This experiment confirms that there are lucky and unlucky splits, especially if large number of different splits are examined.

Optimal descriptor

The optimal descriptor used in this work is calculated as the following: Simplified molecular input-line entry system (SMILES) (Weininger, 1988) is used to represent the molecular structure via SMILES attributes. In this work, two local SMILES attributes (S and SS) and one global SMILES attribute (HARD) are involved to build up predictive models. The S are SMILES atoms, i.e. one symbol from SMILES or two symbols which cannot be examined separately, e.g. ‘Cl’, ‘Br’, etc. The SS are combines of two SMILES atoms. The HARD is global SMILES attribute, which reflects presence (absence) of Nitrogen, Oxygen, Sulphur, Phosphorus, Chlorine, Fluorine, Bromine, Iodine, double and triple covalent bonds (Toropov et al., 2013, Toropova et al., 2011, Toropov et al., 2012a). Table 1 contains example of definition for S, SS, and HARD. The T is threshold, i.e. integer to discriminate all SMILES attributes into two classes (i) rare, i.e. the number of the given attribute in the training set is less than threshold; and (ii) not rare, i.e. the number of given attribute in the training set is larger (or at least equal) than threshold. The N is the number of epochs of the Monte Carlo optimization of the target function (Toropov et al., 2013). The T = T∗ and N = N∗ are values of the parameters which give the best statistics for the calibration set. So-called semi-correlation (Toropov et al., 2012b, Toropova and Toropov, 2017) has been used to build up predictive models for chromosomal aberrations test. Fig. 1 elucidates the interrelations between semi-correlation and binary classification model. Fig. 2 contains an example of the model for chromosome aberration test.

Table 1

Examples of the Sk, SSk, and HARD for molecular structure represented by the following SMILES O = [N+]([O−])c1ccc(cc1)Cl.

S_k

SS_k

Fig. 1

Interpretations for traditional correlation and semi-correlation.

Fig. 2

Graphical representation of semi correlations for split 2 (“lucky split”) and statistical characteristics of this model for chromosome aberration test. TP = true positive; TN = true negative; FP = false positive; and FN = false negative.

Examples of the Sk, SSk, and HARD for molecular structure represented by the following SMILES O = [N+]([O−])c1ccc(cc1)Cl. Interpretations for traditional correlation and semi-correlation. Graphical representation of semi correlations for split 2 (“lucky split”) and statistical characteristics of this model for chromosome aberration test. TP = true positive; TN = true negative; FP = false positive; and FN = false negative.

Statistical criteria

In order to build up classification model i.e. separation of two classes (i) active (1); and (ii) inactive (−1) (Toropova and Toropov, 2017), the following statistical criteria have been used: sensitivity, specificity, accuracy, and Mattews correlation coefficient (MCC). In these equations TP, TN, FP and FN represent the number of true positives, true negatives, false positives and false negatives, respectively, in a confusion matrix. The MCC coefficient is used in machine learning as a balanced measure of the quality of binary classifications and it is useful even if the classes are of very different sizes (Dao et al., 2011). A model is good if MCC → 1 (in praxis, the MCC should be larger than 0.6).

Domain of applicability

Domain of applicability is important component of a QSAR analyses. Diversity of QSAR approaches cause the diversity of conceptions for domain of applicability. A collection of conceptions of domain of applicability is available in literature (Gadaleta et al., 2016): (i) Chemical-physical domain; (ii) Structural domain; (iii) Response domain; and (iv) Integrated methods. However, in the case of the CORAL models, the statistical defects of SMILES calculated according to distribution of available data into the training, invisible training, calibration, and validation sets are the basis to define domain of applicability. The defect of SMILES attribute is defined via the difference of the probability of the attribute in the training set and probability of the attribute in the calibration set. The SMILES-defect is the summation of these defects of attributes. If a SMILES is characterized by the SMILES-defect which is lower than the doubled average defect over compounds of the training set, the SMILES falls into the domain of applicability, otherwise the SMILES is out of the domain of applicability (Toropova and Toropov, 2017):The P(A) and P′(A) are probabilities of attribute A in the training and calibration sets, respectively. The N(A) and N′(A) are frequencies of A in the training and calibration sets, respectively. The NA is the number of attributes in a SMILES. is average SMILES_defect over training set.

Results and discussion

Table 2 contains the statistical characteristics for models of chromosome aberration test built up with the CORAL software. Table 3 contains the statistical characteristics for models suggested in the literature. One can see that the CORAL models are satisfactory and comparable with the analogical models from the literature. Results for the training set are in the range of 0.67–0.76 for sensitivity. Better results have been always obtained for specificity, with values reaching 0.83. The values for accuracy are of course between those of sensitivity and specificity, within a very sharp range, between 0.76 and 78. Indeed, the split 5, which has the lowest sensitivity value, has the highest specificity value, while split 1, with the highest sensitivity value, has a relatively low specificity value. As it often happens with CORAL, highest statistical parameters have been obtained with the calibration set. Better results for specificity are observed on the validation set, with values in the range between 0.81 and 1.0.

Table 2

The statistical quality of models for chromosome aberration test.

Split	Set	n	Sensitivity	Specificity	Accuracy	MCC
1	Training	399	0.7592	0.7981	0.7794	0.5578
	Calibration	39	0.8333	0.8667	0.8462	0.6868
	Validation	39	0.8750	0.8387	0.8462	0.6244

2	Training	407	0.7016	0.8009	0.7543	0.5059
	Calibration	35	0.9375	0.9471	0.9429	0.8849
	Validation	35	0.8750	1.000	0.9429	0.8898

3	Training	380	0.7348	0.7889	0.7632	0.5248
	Calibration	49	0.9333	0.8235	0.8571	0.7097
	Validation	48	0.8148	1.000	0.8958	0.8112

4	Training	398	0.7513	0.7707	0.7613	0.5221
	Calibration	40	0.9412	0.9565	0.9500	0.8977
	Validation	39	1.000	0.6923	0.7949	0.6574

5	Training	399	0.6742	0.8326	0.7619	0.5156
	Calibration	39	0.7600	1.000	0.8462	0.7294
	Validation	39	0.8500	0.9474	0.8974	0.7995

Table 3

The statistical quality of models for chromosome aberration test suggested in the literature.

Reference	Set	n	Sensitivity	Specificity	Accuracy
Multicase methodology Rothfuss et al. (2006)	Training	537	0.528ª	0.75ª	0.649ª
	Internal Validation	53	0.568ª	0.717ª	0.651ª
Machine learning Rothfuss et al. (2006)	Training	521	0.751b	0.768b	0.76b
	Validation	58	0.708b	0.714b	0.716b
Rosenkranz (2004)	Dataset in 9 cross-validation folds	190	0.54	0.70	0.62
(KNN) Serra et al.(2003)	Training	346	0.693	0.861	0.812
	Validation	37	0.727	0.923	0.865
(SVM) Serra et al.(2003)	Training	308	0.989	1	0.997
	Cross-validation	38	0.727	1	0.921
	Validation	37	0.727	0.885	0.838
Estrada and Molina (2006)	Training	216	0.849	0.869	0.86
	Validation	156	0.818	0.829	0.828

Mean value of 10 indipendent validations.

Values represent mean ± standard deviation of 20 indipendent validations.

The statistical quality of models for chromosome aberration test. The statistical quality of models for chromosome aberration test suggested in the literature. Mean value of 10 indipendent validations. Values represent mean ± standard deviation of 20 indipendent validations. The basic hypothesis for the CORAL software is “the good statistical quality of a model for calibration set should be accompanied by the good statistical quality of the model for external validation set”. According this conception the best CORAL model observes for split #4 (MCC = 0.8977). However, for other splits the MCC is quite satisfactory with values larger than 0.6. The fluctuations of the different splits are due to the relatively limited number of chemicals. In these circumstances, only a few substances, which are false positives or false negatives in one or the other split, have high impact on the statistical values. Anyhow, the five splits provide a realistic scenario of the possible expected results in different cases. The general picture of the data indicate that the values are always good, for all criteria examined here. The statistical parameters of other models published in the literature are quite similar to those we obtained. The best published model (Rothfuss et al., 2006) gave sensitivity for the training set of 0.75. The CORAL-model gives similar quality (0.76). The specificity of the CORAL-model is higher. The model by Rosenkranz (2004) has low statistical quality, quite similar to the model developed by Rothfuss et al. (2006) through Multicase methodology. In addition, the CORAL shows better predictive potential for the validation set than the model by Estrada Estrada and Molina (2006). The Support Vector Machine described by Serra et al. (2003) gives prediction poorer that the CORAL. Thus, the CORAL software gives useful predictions for examined endpoint.

Conclusions

The suggested models are built up according to OECD principles. The statistical quality of the models is comparable with similar models suggested in the literature. The semi-correlation is special category used in the CORAL software to build up the binary classifications, in form Yes/No, Active/Inactive. Factually, the approach (semi-correlations) has no analogies. However there are successful attempts to use the approach as a tool of SAR analysis (Toropov et al., 2012b, Toropov et al., 2012c, Toropova and Toropov, 2017). The principle “QSAR is a random event” is confirmed for the case of the semi-correlations developed for different splits into the training and validation sets (Table 2). In other words, the predictive potential of the semi-correlations takes place for all splits, but there are dispersion of statistical characteristics for different splits: there are lucky splits (e.g. #2) and there are unlucky splits (e.g. #1).

40 in total

1. Development of binary classification of structural chromosome aberrations for a diverse set of organic compounds from molecular structure.

Authors: J R Serra; E D Thompson; P C Jurs
Journal: Chem Res Toxicol Date: 2003-02 Impact factor: 3.739

2. SAR modeling of genotoxic phenomena: the consequence on predictive performance of deviation from a unity ratio of genotoxicants/non-genotoxicants.

Authors: Herbert S Rosenkranz
Journal: Mutat Res Date: 2004-04-11 Impact factor: 2.433

Review 3. Recent uses of topological indices in the development of in silico ADMET models.

Authors: Joseph R Votano
Journal: Curr Opin Drug Discov Devel Date: 2005-01

4. Computational prediction of the chromosome-damaging potential of chemicals.

Authors: Andreas Rothfuss; Thomas Steger-Hartmann; Nikolaus Heinrich; Jörg Wichard
Journal: Chem Res Toxicol Date: 2006-10 Impact factor: 3.739

5. Chromosome aberrations in solid tumors have a stochastic nature.

Authors: Mauro A A Castro; Tor G H Onsten; José C F Moreira; Rita M C de Almeida
Journal: Mutat Res Date: 2006-06-30 Impact factor: 2.433

Review 6. Genetic toxicity assessment: employing the best science for human safety evaluation. Part I: Early screening for potential human mutagens.

Authors: David Jacobson-Kram; Joseph F Contrera
Journal: Toxicol Sci Date: 2006-12-28 Impact factor: 4.849

7. Chromosome 17 aneusomy is associated with poor prognostic factors in invasive breast carcinoma.

Authors: A D Watters; J J Going; T G Cooke; J M S Bartlett
Journal: Breast Cancer Res Treat Date: 2003-01 Impact factor: 4.872

8. Chromosomal aberrations and cancer risk: results of a cohort study from Central Europe.

Authors: Paolo Boffetta; Olga van der Hel; Hannu Norppa; Eleonora Fabianova; Aleksandra Fucic; Sarolta Gundy; Juozas Lazutka; Antonina Cebulska-Wasilewska; Daniela Puskailerova; Ariana Znaor; Zsolt Kelecsenyi; Juozas Kurtinaitis; Jadwiga Rachtan; Alessandra Forni; Roel Vermeulen; Stefano Bonassi
Journal: Am J Epidemiol Date: 2006-10-27 Impact factor: 4.897

9. Automatic extraction of structural alerts for predicting chromosome aberrations of organic compounds.

Authors: Ernesto Estrada; Enrique Molina
Journal: J Mol Graph Model Date: 2006-02-17 Impact factor: 2.518

10. Patterns of chromosomal imbalances defines subgroups of breast cancer with distinct clinical features and prognosis. A study of 305 tumors by comparative genomic hybridization.

Authors: Karin Rennstam; Minna Ahlstedt-Soini; Bo Baldetorp; Pär-Ola Bendahl; Ake Borg; Ritva Karhu; Minna Tanner; Mika Tirkkonen; Jorma Isola
Journal: Cancer Res Date: 2003-12-15 Impact factor: 12.701

2 in total

1. The use of fast molecular descriptors and artificial neural networks approach in organochlorine compounds electron ionization mass spectra classification.

Authors: Maciej Przybyłek; Waldemar Studziński; Alicja Gackowska; Jerzy Gaca
Journal: Environ Sci Pollut Res Int Date: 2019-07-30 Impact factor: 4.223

2. A regression-based QSAR-model to predict acute toxicity of aromatic chemicals in tadpoles of the Japanese brown frog (Rana japonica): Calibration, validation, and future developments to support risk assessment of chemicals in amphibians.

Authors: Andrey A Toropov; Matteo R Di Nicola; Alla P Toropova; Alessandra Roncaglioni; Edoardo Carnesecchi; Nynke I Kramer; Antony J Williams; Manuel E Ortiz-Santaliestra; Emilio Benfenati; Jean-Lou C M Dorne
Journal: Sci Total Environ Date: 2022-03-25 Impact factor: 10.753

2 in total