Literature DB >> 22049275

Application of different chemometric tools in QSAR study of azolo-adamantanes against influenza A virus.

Abstract

Quantitative relationships between molecular structure and azolo-adamantanes derivatives were discovered by different chemometric tools including factor analysis based multiple linear regressions (FA-MLR), principle component regression analysis (PCRA), and genetic algorithm-partial least squares GA-PLS. The FA-MLR describes the effect of geometrical and quantum indices on enzyme inhibition activity of the studied molecules. The quality of PCRA equation was found to be better than those derived from FA-MLR. GA-PLS analysis indicated that the topological (IC4 and MPC06), constitutional (nf) and geometrical (G (N..S] parameters were the most significant ones on influenza A virus activity. Comparison of the different statistical methods employed revealed that GA-PLS represented superior results and it could explain and predict 85% and 77% of variances in the pIC(50) data, respectively.

Entities: Chemical Disease Gene Species

Keywords: Azolo-adamantanes; FA-MLR; GA-PLS; Influenza A; PCRA; QSAR

Year: 2011 PMID： 22049275 PMCID： PMC3203269

Source DB: PubMed Journal: Res Pharm Sci ISSN： 1735-5362

INTRODUCTION

Synthesis and evaluation of biological effects of new compounds usually consumes a lot of time and money. Nowadays, the application of computational methods for designing biologically active compounds has opened a new window to modern drug discovery research. Computational methods can accelerate the procedure of discovering new drugs by designing new compounds and predict their potency or activity. Quantitative structure activity relationships (QSAR) studies, as one of the most important areas in chemometrics, play a fundamental role in predicting the biological activity of new compounds and identifying ligand-receptor interactions (1–5). QSAR models are mathematical equations that provide a deeper knowledge into the mechanism of biological activity of compounds by constructing a relationship between chemical structures and biological activities. The most important step in building QSAR models is the appropriate representation of the structural and physicochemical features of chemical entities (6–9). These features which are defined as molecular descriptors are the ones with higher impact on the biological activity of interest (10–13). Molecular descriptors have been classified into different categories according to different approaches including physiochemical, constitutional, geometrical, topological, and quantum chemical descriptors. Dragon and Gaussian are two well-known computational softwares which can provide more than 1000 of these descriptors (1415). The first step in constructing the QSAR/QSPR models is the selection of molecular descriptors that represent variation in the interested property of the molecules by a number. The selected descriptors then will be used for constructing statistical models. There are two types of QSAR/QSPR models: regression models and classification models. Multiple linear regression (MLR), principle component regression (PCR), and partial least squares (PLS) are considered as regression models. Although MLR equations can describe the structure property relationships appropriately, some information will be disregarded in MLR analysis. Due to the co-linearity problem in MLR analysis, one may remove the collinear descriptors before the development of the MLR model. There are several variable selection methods including forward, backward, and stepwise selection. There are also some other methods inspired by the nature of which genetic algorithm is the most widely used. Factor analysis identifies the important predictor variables contributing to the response variable and avoids collinearities among them. PLS analysis as a factor analysis-based method omits the multicolinearity problem in the descriptors. In this method, the descriptors data matrix is decomposed to orthogonal matrices with an inner relationship between the dependent and independent variables. Because a minimal number of latent variables are used for modeling in PLS, this modeling method coincides with noisy data better than MLR (11–13). Each winter, millions of people suffer from influenza, a highly contagious infection. The influenza virions are enveloped, mostly as spherical particles containing an outer lipid membrane. The genome of influenza virus is represented by eight separate segments of single-strand negative RNA associated with nucleoprotein and several molecules of the three subunits of its RNA polymerase. Unlike eukaryotic RNA polymerase, viral polymerase complex lacks error-prone activity. For this reason, similar to other RNA viruses, influenza virus has a very high rate of mutations in its genome leading to the fast selection of drug-resistant strains. Despite numerous steps in the viral life cycle that are potential targets for drug intervention, only two of them are now available for clinical usage. Currently, two main classes of chemical compounds are used for the treatment of influenza. They differ in their viral targets and mechanisms of action. The antiviral drugs amantadine and rimantadine block a viral ion channel and prevent the virus from infecting cells. Oseltamivir and zanamivir are designed to halt the spread of the virus in the body (16). The structural invariants obtained from whole molecular structures and three different chemometric methods were used to make connections between structural parameters and azolo-adamantanes. These methods included partial least squares combined with genetic algorithm for variable selection (GA-PLS), factor analysis–MLR (FA-MLR) and principle component regression analysis (PCRA).

MATERIALS AND METHODS

Software

A Pentium IV personal computer (CPU at 3.06 GHz) with windows XP operating system was used. Geometry optimization was performed by Hyperchem (version 7.0 Hypercube, Inc.) Dragon software was used for calculation of constitutional, topological, geometrical, and functional group descriptors. Gaussian software was used for calculation of quantum descriptors. SPSS software (version 11.50, IBM, Inc.) was used for PCR and FA-MLR analysis. GA-PLS regression and other calculations were performed in the MATLAB (version 7.1, MathWorks, Inc.) environment.

Activity data and descriptor generation

The biological data used in this study were anti influenza A activity, (in terms of -log IC50), of a set of forty six azolo-adamantanes derivatives (16). The structural features and biological activity of these compounds are listed in Table 1 and then used for subsequent QSAR analysis as dependent variable. The two-dimensional structures of molecules were drawn using Hyperchem 7.0 software. The final geometries were obtained with the semi-empirical AM1 method in Hyperchem program. The molecular structures were optimized using Polak-Ribiere algorithm until the root mean square gradient was 0.01 kcal mol-1. Some chemical parameters including molecular volume (V), molecular surface area (SA), hydrophobicity (Log P), hydration energy (HE) and molecular polarizability (MP) were calculated using the Hyperchem Software. The resulted geometry by the Hyperchem software was transferred into Dragon program, which was developed by Milano Chemometrics and QSAR Group (14). Different functional groups, topological, geometrical and constitutional descriptors for each molecule were calculated by Dragon software. Z-matrices of the structures were provided by the Hyperchem software and transferred to Gaussian 98 program. Complete geometry optimization was performed taking the most extended conformation as starting geometries. Semi-empirical molecular orbital calculation (AM1) of the structures was preformed using the Gaussian 98 program (15). The Gaussian program calculated different quantum chemical descriptors including, dipole moment (DM), local charges, and HOMO and LOMO energies. Hardness (η), softness (S), electronegativity (χ) and electrophilicity (ω) were calculated according to the method proposed by Thanikaivelan and coworkers (17). The calculated descriptors from whole molecular structures are briefly described in Table 2.

Table 1

Chemical structures of azolo-adamantanes analogues used in this study and their experimental activity against influenza A virus

Table 2

Brief description of some descriptors used in this study

Chemical structures of azolo-adamantanes analogues used in this study and their experimental activity against influenza A virus Brief description of some descriptors used in this study

Data Pretreatment and model building

Anti influenza A activity was used as dependent variable. The calculated descriptors (independent variables) were collected in a data matrix whose number of rows and columns were the number of molecules and descriptors, respectively. In order to test the final model performances, about 18% of the data (8 molecules out of 46) were selected as external test set molecules These samples were selected based on descriptors spaces. The data matrix containing the total descriptors was subjected to principle component analysis and the first two principle components were plotted against each other. GA-PLS, MLR with factor analysis as the data pre-processing step for variable selection and PCRA methods were used to derive the QSAR equations.

RESULTS

GA-PLS

In this study, GA-PLS was employed to model the structure azolo-adamantanes activity relationships more appropriately. (1819). Application of PLS method thus allows the construction of larger QSAR equations while still avoiding over-fitting and eliminating most variables. This method is normally used in combination with cross-validation to obtain the optimum number of components (20–21). The PLS regression method used was the NIPALS-based algorithm existed in the chemometric toolbox of MATLAB software (version 7.1 MathWorks, Inc.). In order to obtain the optimum number of factors based on the Haaland and Thomas F-ratio criterion, leave-one-out cross-validation procedure was used (22). Genetic algorithm is a novel and simple optimization method based on the evolution process of living beings in which simplicity and effectiveness have been applied to the various types of optimization problems in many scientific fields. It uses genetic rules such as reproduction, crossover and mutation to build pseudo organisms that are then selected, on the basis of a fitness criterion to survive and pass information on to the next generation (23–25). Each individual of the population was defined by a chromosome of binary values representing a subset of descriptors. The population size varied between 50 and 250 for different GA runs. The population of the first generation was selected randomly. The number of genes at each chromosome was equal to the number of descriptors (26). A gene took a value of 1 if its corresponding descriptor was included in the subset; otherwise, it took a value of 0. The number of genes with a value of 1 was kept relatively low to have a small subset of descriptors, that is, the probability of generating 0 for a gene was set greater (at least 70%) than the value of 1 (25). The operators used here were crossover and mutation. The probability of the application of these operators varied linearly with generation renewal (0-10% for mutation and 60-90% for crossover). For a typical run, the evolution of the generation was stopped when 90% of the generation took the same fitness. A maximum generation number of 500 were used throughout. The fitness function (predictability of the model) was computed by crossvalidation procedure based on the sum of squares of errors (SSECV) value. The inverse of SSECV was considered as fitness function (27). The chromosomes with the least numbers of selected descriptors and the highest fitness were marked as informative chromosomes (26). In PLS analysis, the descriptors data matrix is decomposed to orthogonal matrices with an inner relationship between the dependent and independent variables. The multi-colinearity problem in the descriptors is omitted by PLS analysis because a minimal number of latent variables are used for modeling in PLS (26). Since redundant variables degrade the performance of PLS analysis, similar to other regression methods, a variable selection method must be employed to find the more convenient set of descriptors. Here, GA was used as variable selection method. These samples were selected based on descriptors spaces. To do so, the data matrix containing the total descriptors was subjected to principle component analysis and the first two principle components were plotted against each other. The data set (n=46) was divided into two groups: calibration set (n=38) and prediction set (n=8). Given 38 calibration samples; cross-validation procedure was used to find the optimum number of latent variables for each PLS model. GA produces a population of acceptable models in each run. In this work, many different GA-PLS runs were conducted using different initial set of populations (50-250) and therefore a large number of acceptable models were created. The most convenient GA-PLS model that resulted in the best fitness contained 10 descriptors including four topological indices (PW2, SIC2, IC4 and MPC06), one constitutional (nf), two geometrical (G (N..S) and PJI3) and three quantum parameters (LUMO, DMz, DMx). The PLS estimate of the regression coefficients are shown in Fig. 1 Since these constants were calculated based on the normalized descriptor values, they can be used as a measure of the importance of the corresponding descriptor. As it is observed, the topological (IC4 and MPC06), constitutional (nf) and geometrical (G (N..S)) parameters represent the most significant contribution in the obtained QSAR model followed by the functional geometrical and topological parameters (PJI and SIC2).

Fig. 1

PLS regression coefficients for the variables used in GA-PLS model

PLS regression coefficients for the variables used in GA-PLS model The statistical parameters of the resulted PLS-based QSAR model are given in Table 3). This GA-PLS model possessed high statistical quality R2 =0.86 and Q2 =0.77. It could explain and predict about 77% of variances in the anti influenza A activity of the studied molecules. The predictive ability of the model was measured by application to 8 external test set molecules. The correlation coefficient of prediction set is 0.85, which means that the resulted QSAR model could predict 85% of variances in the inhibitory activity data and standard error of prediction was 0.13.

Table 3

Statistical parameters for testing prediction ability of the FA-MLR, PCR and GA-PLS models

Statistical parameters for testing prediction ability of the FA-MLR, PCR and GA-PLS models The predicted activities are represented in Table 1 and are plotted against the corresponding experimental values in Fig. 2 Comparison between the results obtained by GA-PLS and the other employed regression methods indicates higher accuracy of this method in describing anti influenza A activity of the azolo-adamantanes derivative. Difference in accuracy of the different regression methods used in this study is visualized in Fig. 2 by plotting the predicted activity (by cross-validation) against the experimental values. As it is observed, the plot of data resulted by GA-PLS represents the lowest scattering of data around a straight line and that obtained by PCRA analysis is in the second order of accuracy.

Fig. 2

Plots of the cross-validated predicts activity against the experimental activity for the different models obtained against Influenza A

Plots of the cross-validated predicts activity against the experimental activity for the different models obtained against Influenza A Some criteria for the prediction of the model are suggested by Tropsha. If these criteria are satisfied, it can then be concluded that the model is predictive: where, R2 is the correlation coefficient of regression between the predicted and observed activities of compounds in training and test set. Ro2 is the correlation coefficient for regressions between predicted versus observed activities through the origin, R’o2 is the correlation coefficient for the regressions between observed versus predicted activities through the origin, and the slopes of the regression lines through the origin are assigned by k and k’, respectively. Details of the definitions of parameters such as Ro2, R’o2, k and k’ are presented in the literature. In addition, according to Roy and coworkers, it is necessary to study the differences between the values of Ro2 and R’o2. They suggest the following modified R2 form: if R2m value for the given model is >0.5, indicates good external predictability of the developed model (28).

Robustness and applicability domain of the models

Leverage is one of standard methods for this purpose. The numerical value of leverage has certain properties: (a) the value is always greater than zero, (b) the lower the value, the higher is the confidence in the prediction. A value of 1 indicates very poor prediction. A value of 0 indicates perfect prediction and usually is not achievable, (c) If there are P coefficients in the model, the sum of values for leverage at each experimental point of calibration adds up to P. Warning leverage (h*) is another criterion for interpretation of the results. The warning leverage is, generally, fixed at 3k/n, where n is the number of training compounds and k is the number of model parameters. A leverage greater than warning leverage h* means that the predicted response is the result of substantial extrapolation of the model and therefore may not be reliable (29). The calculated leverage values of the test set samples for different MLR and PCR models are listed in Table 4. The warning leverage, as the threshold value for accepted prediction, is also given in Table 5. As seen, the leverages of all test samples are lower than h* for all models. This means that all predicted values are acceptable.

Table 4

Statistical parameters obtained for the developed model of the investigated compounds

Table 5

Leverage (h) of the external test set molecules for different models. The last row (h*) is the warning leverage.

Statistical parameters obtained for the developed model of the investigated compounds Leverage (h) of the external test set molecules for different models. The last row (h*) is the warning leverage.

FA-MLR and PCRA

FA-MLR was performed on the dataset. Factor analysis (FA) was used to reduce the number of variables and to detect structure in the relationships among them. This data-processing step is applied to identify the important predictor variables and to avoid collinearities (30). PCRA, was tried for the data set along with FA-MLR. With PCRA, collinearities among X variables are not a disturbing factor and the number of variables included in the analysis may exceed the number of observations (31). In this method, factor scores, as obtained from FA, are used as the predictor variables (30). In PCRA, all descriptors are assumed to be important while the aim of factor analysis is to identify relevant descriptors. Table 6 shows the 4 factor loadings of the variables (after VARIMAX rotation) for the compounds tested against influenza A. As it is observed, about 73% of variances in the original data matrix could be explained by the selected 4 factors.

Table 6

Numerical values of factor loading numbers 1-4 for descriptors after VARIMAX rotation

Numerical values of factor loading numbers 1-4 for descriptors after VARIMAX rotation Based on the procedure explained in the experimental section, the following three-parametric equation was derived: Equation 1 could explain about 72% of the variance and predict 64% of the variance in pIC50 data. This equation describes the effect of geometrical (G (N..S) and PJI3) and Quantum (DMz) indices on enzyme inhibitory activity of the studied molecules. When factor scores were used as the predictor parameters in a multiple regression equation using forward selection method (PCRA), the following equation was obtained: Equation 2 could explain and predict 85% and 82% of the variances in pIC50 data, respectively. Since factor scores are used instead of selected descriptors, and any factor-score contains information from different descriptors, loss of information is thus avoided and the quality of PCRA equation is better than those derived from FA-MLR (32). As seen in Table 6, in the case of each factor, the loading values for some descriptors are much higher than those of the others. These high values for each factor indicate that this factor contains more information about descriptors. It should be noted that all factors have information from all descriptors but the contribution of descriptor in different factors are not equal. For example, factors 1 and 2 have higher loadings for the geometrical, topological and constitutional indices, whereas information about the topological, geometrical and quantum descriptors are highly incorporated in factor 3 and 4. Therefore, from the factor scores used by equation E2 , significance of the original variables for modeling the activity can be obtained. Factor score 1 indicates importance of G (N..S) (Geometrical indices). Factor score 2 indicates importance of nf and PW2 (the constitutional and topological descriptors) and factor scores 3 and 4 signify the importance of MPC06, PJI3 and DMz (the topological, geometrical and Quantum descriptors). The predicted values of the activity for calibration set (by cross-validation) and prediction set for FA-MLR and PCRA are listed in Table 1 and are plotted against the corresponding experimental values in Fig. 2 The statistical parameters of prediction set are listed in Table 3. The correlation coefficient of prediction for FA-MLR analysis is 0.78, which means that the obtained QSAR model could predict 78% of variances in the anti influenza A activity data. It has a root mean square error of 0.19. The correlation coefficient of prediction for PCRA analysis is 0.82. This means that the derived QSAR model could predict 82% of variances in the inhibitory activity data. The root mean square error of PCRA analysis was 0.12. Whilst the data of this analysis shows acceptable prediction, we see that the predicted values of some molecules are near to each other.

DISCUSSION

Quantitative relationships between molecu-lar structure and anti influenza activity were discovered by GA-PLS, FA-MLR and PCRA. As it was shown in Fig. 1 the topological (IC4 and MPC06), constitutional (nf) and geometrical (G (N..S)) parameters represent the most significant contribution in the obtained QSAR model followed by the functional geometrical and topological parameters (PJI and SIC2). FA-MLR was performed on the dataset. Equation 1 describes the effect of geometrical (G (N..S) and PJI3) and Quantum (DMz) indices on enzyme inhibitory activity of the examined molecules. PCRA was performed on the dataset and equation 2 could explain and predict 85% and 82% of the variances in pIC50 data.

CONCLUSION

Quantitative relationships between molecu-lar structure and anti influenza A activity of a series of azolo-adamantanes derivatives were discovered by different chemometric tools including FA-MLR, PCRA and GA-PLS. The FA-MLR describes the effect of geometrical and quantum indices on inhibitory activity of the examined molecules. The quality of PCRA equation is better than those derived from FA-MLR. Factor scores 1 and 2 indicate importance of geometrical, constitutional and topological indices. Factor scores 3 and 4 show the importance of geometrical, topological and quantum descriptors. GA-PLS analysis indicated that the topological (IC4 and MPC06), constitutional (nf) and geometrical (G (N..S] parameters were the most significant parameters on inhibitory activity. A comparison between the different statistical methods employed revealed that GA-PLS represented superior results and it could explain and predict 85% and 77% of variances in the pIC50 data, respectively.

14 in total

1. Comparative QSAR: Toward a Deeper Understanding of Chemicobiological Interactions.

Authors: Corwin Hansch; David Hoekman; Hua Gao
Journal: Chem Rev Date: 1996-05-09 Impact factor: 60.622

2. Structure/response correlations and similarity/diversity analysis by GETAWAY descriptors. 2. Application of the novel 3D molecular descriptors to QSAR/QSPR studies.

Authors: Viviana Consonni; Roberto Todeschini; Manuela Pavan; Paola Gramatica
Journal: J Chem Inf Comput Sci Date: 2002 May-Jun

3. Genetic Algorithm guided Selection: variable selection and subset selection.

Authors: Sung Jin Cho; Mark A Hermsmeier
Journal: J Chem Inf Comput Sci Date: 2002 Jul-Aug

4. A novel subshape molecular descriptor.

Authors: Santosh Putta; John Eksterowicz; Christian Lemmen; Robert Stanton
Journal: J Chem Inf Comput Sci Date: 2003 Sep-Oct

5. Effect of the electronic and physicochemical parameters on the carcinogenesis activity of some sulfa drugs using QSAR analysis based on genetic-MLR and genetic-PLS.

Authors: Omar Deeb; Bahram Hemmateenejad; Amal Jaber; R Garduno-Juarez; Ramin Miri
Journal: Chemosphere Date: 2007-02-20 Impact factor: 7.086

6. Quantitative structure-activity relationship for cyclic imide derivatives of protoporphyrinogen oxidase inhibitors: a study of quantum chemical descriptors from density functional theory.

Authors: Jian Wan; Li Zhang; Guangfu Yang; Chang-Guo Zhan
Journal: J Chem Inf Comput Sci Date: 2004 Nov-Dec

7. Synthesis and anti-viral activity of azolo-adamantanes against influenza A virus.

Authors: Vladimir V Zarubaev; Efim L Golod; Pavel M Anfimov; Anna A Shtro; Victor V Saraev; Alexey S Gavrilov; Alexander V Logvinov; Oleg I Kiselev
Journal: Bioorg Med Chem Date: 2009-11-29 Impact factor: 3.641

8. Application of ab initio theory to QSAR study of 1,4-dihydropyridine-based calcium channel blockers using GA-MLR and PC-GA-ANN procedures.

Authors: Bahram Hemmateenejad; Mohammad A Safarpour; Ramin Miri; Fariba Taghavi
Journal: J Comput Chem Date: 2004-09 Impact factor: 3.376

9. Quantitative structure-activity relationship study of recently synthesized 1,4-dihydropyridine calcium channel antagonists. Application of the Hansch analysis method.

Authors: Bahram Hemmateenejad; Ramin Miri; Morteza Akhond; Mojtaba Shamsipur
Journal: Arch Pharm (Weinheim) Date: 2002-12 Impact factor: 3.751

10. QSAR study of antimicrobial 3-hydroxypyridine-4-one and 3-hydroxypyran-4-one derivatives using different chemometric tools.

Authors: Razieh Sabet; Afshin Fassihi
Journal: Int J Mol Sci Date: 2008-12-02 Impact factor: 6.208

3 in total

1. Modeling the reactivities of hydroxyl radical and ozone towards atmospheric organic chemicals using quantitative structure-reactivity relationship approaches.

Authors: Shikha Gupta; Nikita Basant; Dinesh Mohan; Kunwar P Singh
Journal: Environ Sci Pollut Res Int Date: 2016-04-04 Impact factor: 4.223

2. Probing the chemical interaction space governed by 4-aminosubstituted benzenesulfonamides and carbonic anhydrase isoforms.

Authors: Behnam Rasti; Yeganeh Entezari Heravi
Journal: Res Pharm Sci Date: 2018-06

3. QSAR study of HCV NS5B polymerase inhibitors using the genetic algorithm-multiple linear regression (GA-MLR).

Authors: Hamid Rafiei; Marziyeh Khanzadeh; Shahla Mozaffari; Mohammad Hassan Bostanifar; Zhila Mohajeri Avval; Reza Aalizadeh; Eslam Pourbasheer
Journal: EXCLI J Date: 2016-01-18 Impact factor: 4.068

3 in total