L Saghaie1, M Shahlaei, A Fassihi. 1. Department of Medicinal Chemistry and Isfahan Pharmaceutical Research Center, School of Pharmacy and Pharmaceutical Sciences, Isfahan University of Medical Sciences, Isfahan, I.R. Iran.
Abstract
Quantitative relationships between structures of twenty six of 2-mercaptoimidazoles as C-C chemokine receptor type 2 (CCR2) inhibitors were assessed. Modeling of the biological activities of compounds of interest as a function of molecular structures was established by means of genetic algorithm multivariate linear regression (GA-MLR) and genetic algorithm (GA-ANN). The results showed that, the pIC50 values calculated by GA-ANN are in good agreement with the experimental data, and the performance of the artificial neural networks regression model is superior to the multivariate linear regression-based (MLR) model. With respect to the obtained results, it can be deduced that there is a non-linear relationship between the pIC50 s and the calculated structural descriptors of the 2-mercaptoimidazoles. The obtained models were able to describe about 78% and 93% of the variance in the experimental activity of molecules in training set, respectively. The study provided a novel and effective approach for predicting biological activities of 2-mercaptoimidazole derivatives as CCR2 inhibitors and disclosed that combined genetic algorithm and GA-ANN can be used as a powerful chemometric tools for quantitative structure activity relationship (QSAR) studies.
Quantitative relationships between structures of twenty six of 2-mercaptoimidazoles as C-C chemokine receptor type 2 (CCR2) inhibitors were assessed. Modeling of the biological activities of compounds of interest as a function of molecular structures was established by means of genetic algorithm multivariate linear regression (GA-MLR) and genetic algorithm (GA-ANN). The results showed that, the pIC50 values calculated by GA-ANN are in good agreement with the experimental data, and the performance of the artificial neural networks regression model is superior to the multivariate linear regression-based (MLR) model. With respect to the obtained results, it can be deduced that there is a non-linear relationship between the pIC50 s and the calculated structural descriptors of the 2-mercaptoimidazoles. The obtained models were able to describe about 78% and 93% of the variance in the experimental activity of molecules in training set, respectively. The study provided a novel and effective approach for predicting biological activities of 2-mercaptoimidazole derivatives as CCR2 inhibitors and disclosed that combined genetic algorithm and GA-ANN can be used as a powerful chemometric tools for quantitative structure activity relationship (QSAR) studies.
Chemokines or chemotactic cytokines are a large group of small (~ 8-15 kDa) proteins that relate to each other structurally and functionally and insert a significant function in leukocyte migration and activation (1–5). Chemokines mediate their influences through activation of particular proteins in surface of the cells belonging to well-known seven transmembrane spanning G-protein coupled receptors family (GPCR). Monocyte chemoattractant protein (MCP-1/CCL2) is a part of the CC chemokine subgroup which is attached to the CC chemokine receptor 2 (CCR2) expressed on the greater number of blood born monocytes (6). Disturbance of the MCP-1/CCR2 route in rodent models of inflammatory and autoimmune diseases by genetic deletion of either MCP-1 (7) or CCR2 (8-10) and use of peptidyl CCR2 antagonists (11) or anti-MCP-1 antibodies (12) propose that inhibition of CCR2 may supply possible therapies for a variety of sicknesses including rheumatoid arthritis (1112) multiple sclerosis (13–15) and atherosclerosis (1016–18). These outlooks have stimulated the search for small molecule MCP-1/CCR2 antagonists in a large number of research laboratories (19). Lately, quantitative structure activity relation-ships (QSARs) have been employed widely to generate models in order to calculate and predict biological or toxicological values of drug candidate compounds using computa-tional descriptors solely extracted from molecular structure.For the first time, McCulloch and Pitts (20) proposed artificial neural networks (ANN) as a technique of data mining employing a neural network’s information processing units (neurons) as centers of data analyzing that are organized in layers. An ANN is a tool for processing the input information. ANN is built based on the structure and function of the human brain as a template. Central nervous system of human is consisted of a series of neurons interconnected to each other by synapses. Information transfer between these neurons via a series of action potentials has been proved by scientists (21). Various ANN algorithms have advantages such as adaptive learning behavior, capability of parallel distributed processing and good generalization property for unseen data.The ANN method has several benefits over traditional regression methods, since they need known input data set without any suppositions (22). The ANN generates a mapping of the input and output variables, which can afterwards be employed to predict wanted output as a function of appropriate inputs (23). A multi layer neural network can estimate any smooth and measurable relationship between input and output vectors by optimizing a fitted set of connecting weights and transfer functions (22). ANN models could describe any nonlinear relationship between calculated descriptors of drug like compounds and their bioactivities (2425). Therefore; it is more desirable to apply a non parametric method such as feed forward back propagation neural network QSAR modeling to characterize such a nonlinear relationship (26).Here we describe multiple linear regressions (MLR) as a linear method and back propagation ANN as a nonlinear technique for investigating of the relationship between the structure and the CCR2 antagonist activity of some 2- mercaptoimidazoles compounds. We further make a comparison between the two different methods to verify their efficacy in modeling in the inhibitory activity of the studied compounds.
MATERIALS AND METHODS
Computer hardware, software and preparation of data set
All calculations were run on a desktop computer with Windows XP operating system. Bioactivities of 26 C-C chemokine receptor type 2 (CCR2) antagonists were taken from the literature (27) , and are presented in Table 1. These values were converted from IC50 to pIC50 (-logarithm of IC50 ). The two-dimensional structures of molecules were drawn by Hyperchem 7.0 software. The ultimate conformations were calculated with the semi-empirical AM1 technique.
Table 1
Structures of some 2-mercaptoimidazoles as CCR-2 Inhibitors used in this study
Structures of some 2-mercaptoimidazoles as CCR-2 Inhibitors used in this studyThe molecular structures were optimized using the Polak-Ribiere algorithm until the root mean square gradient was 0.01 kcal mol-1. The z-matrix of structures was provided by the Hyperchem and transferred to the Gaussian 98 program (28). Whole conformation optimiz-ation was carried out taking the most extended conformation as starting geometries. Semi empirical molecular orbital calculation (AM1) of the structures was preformed again to avoid trapping in local minimal using Gaussian 98 program. The obtained conformation was relocated to Dragon program package, which was developed by Milano Chemometrics and QSAR Group (29). Dragon software was employed to calculate a large number of descriptors including geometrical, topological, functional group and constitutional. The name and number of calculated descriptors are listed in Table 2. After calculation of descriptors, in the preprocessing step, the estimated descriptors were investigated for descriptors that have constant values for all studied molecules and those discerned were deleted from data matrix.
Table 2
Short description of some descriptors used in this study including their name and number.
Short description of some descriptors used in this study including their name and number.To reduce the redundancy existed in the calculated descriptors, the correlation among descriptors and with the bioactivity of the molecules was checked and collinear descriptors (i.e. R2 > 0.90) were detected. Among the collinear descriptors, one with the highest correlation with bioactivity kept for model building phase and the others were removed.MATLAB software (version 7.1 Math Work Inc.) was used for developing some scripts to perform ANN regression modeling and model validation.The data set was split into a training set and a prediction set using Kenard and Stones algorithm (30). According to Tropsha the best models would be built when this algorithm was used (31). The training set of 21 molecules was employed to adjust the parameters of the developed QSAR models, and the test set of 5 compounds was employed to assess its prediction capability.
Feature selection using genetic algorithm
Where the number of independent variables is more than investigated molecules, feature selection is necessary for avoiding chance correlation and selecting the most informative descriptors. However, selecting the sufficient and informative descriptors for biological activity in QSAR studies is not easy because there are no universal rules that manage this selection. Genetic algotrithm (GA) is one of the best methods to feature selection in model building. The GA used here was demonstrated in other literature (33) and does not present for brevity.
Multiple linear regression
The general purpose of multivariate linear regressions (MLR) is to quantify the relationship between several independent or predictor variables and a dependent variable. Independent or predictor variables could be various physicochemical descriptors of molecules, their principle components or latent variables. After building the model, the activity value of each ligand would then be calculated using the developed model. A multi linear model can be represented as:y= a0 + a1 x1 + a2 x2 + a3 x3 +…+ an xn + β (1)where n is the number of independent variables, a1 ,..., an denote the regression coefficients, β is the error and y is the dependent variable. Regression coefficients signify the independent contributions of each molecular descriptor.
Artificial neural networks
An itemized explanation of the theory behind ANN has been sufficiently explained elsewhere (34–36). The pertinent rule of supervised learning in an ANN is that it obtains numerical types inputs (the training data) and conveys them into preferred outputs. The input and output neurons may be joined to the ’external world’ and to other neurons inside the artificial network. The method in which each neuron conveys its input is dependent on the so called weights and bias of the neurons, which are adjustable. The output values of each neuron rely on both the weight strengths and bias values. Also, the outputs depend on the weighted sum of all its inputs which are usually conveyed using a nonlinear weighting function. For the at hand goals, the big strength of ANNs systems arise from this fact that it is conceivable to train this systems.Training is carried out through successively introducing the networks with certain inputs and outputs and adjusting the connection weights and biases between the individual neurons. This procedure is corroborated until the output neurons of the network match the preferred outputs to a desired degree of accuracy. Though, training can be carried out by using the back propagation algorithm. In order to train the network using this algorithm, the differences between the ANNs output and its preferred output are estimated after each epoch. The changes in the values of the weights can be calculated by using following equation:where, Δw is the change in the values of weights for each network neuron, δi is the actual error of neuron i, and Oj is the output of neuron j. The coefficients η and α are the learning rate and the momentum factor, respectively. These coefficients manage the velocity and the efficacy of the learning course. These parameters would be optimized before training the network. Equation like Equation (2) can be employed for the bias settings.The ANN can apply qualitative as well as quantitative inputs, and it does not need an unambiguous relationship connecting the inputs and the outputs. Though in statistics the analysis is limited to a known number of possible interactions, more expressions can be checked for interactions by the ANNs. In addition, by permitting more information to be analyzed at the same time, more complicated and delicate interactions can be investigated using this method.
Validation of QSAR models
Some of common parameters used for checking predictability of proposed models are root mean square error (RMSE), square of the correlation coefficient (R2), an predictive residual error sum of squares (PRESS). These parameters were calculated for each model as follows:where, yi is the true bioactivity of the investigated compound i , represents the calculated bioactivity of the compound i, the mean of true activity in the studied set, and n the total number of molecules used in the studied sets.The value of R2 can be usually raised by adding the additional independent variables to the generated model, even if the added independent variable does not cause to the decrease of the unexplained variance of the dependent variable. Consequently, the use ofwhere, n is the number of molecules in studied data set and p is the number of independent variables in generated model.The actual efficacy of generated QSAR models is not just their capability to reproduce known data that is confirmed by their fitting power (R2), but is chiefly their feasibility of predictive application. Hence, the QSAR model estimations were carried out maximizing the explained variance in prediction, assigned by the leave-one-out cross-validated correlation coefficient, Q2.Also, the predictive ability of the regression model generated on the training set molecules is estimated on the predictions of test set compounds, by the external R2p defined as follows (37):Where, the averaged value of the bioactivity for the training set; the summations cover all the molecules in the test set.An accepted technique employed by researchers to defend their generated models against the danger of chance correlation between dependent and independent variables has been y-randomization that is., fortuitous correlation without any predictability for developed model. Y randomization is a method that is said to be “probably the most powerful validation procedure” (39).
Applicability domain of the model
The presence of response outliers (i.e. molecules with standardized residuals greater than two standard deviation units) in the investigated data set and compounds very effective in determining figures of merit and statistical parameters of developed model [i.e. molecules with high leverage value (h) greater than 3k´ / n where k´ is the number of model variables plus one, and n the number of the molecules applied in model development] were confirmed by the Williams plot (38).
RESULTS
The structures of 26 molecules were built and optimized and a large number of descriptors (columns of X block) were estimated for each molecule using its molecular structure. In order to obtain the relationship between the biological activities as dependent and molecular structures as independent variables, logarithms of the inverse of biological activity (log 1/IC50 ) of 26 molecules were used. After dividing the molecules into calibration and validation sets, based on Kennard and Stones algorithm, different models using training set were built. Developed models were used to predict the activity of molecules in test set to evaluate the performance of models.To determine the degree of homogeneities in the original data set and recognize potential clusters in the studied molecules, principle component analysis (PCA) was performed within the calculated pixels space for all of the molecules. PCA is a valuable multivariate statistical approach in which new orthogonal variables called principal components or PCs are derived as linear combinations of the original variables. These new generated variables are sorted on the basis of information content (i.e. explained variance of the original dataset). Priority of PCs demonstrates their higher quota in the explained variance, so most of the information is retained in the early few PCs. A main characteristic in PCA is that the generated PCs are uncorrelated. PCs can be used to obtain scores which present most of the original variations in the original data set in a smaller number of dimensions.Here, using three more significant PCs (eigenvalues>1), which explain 77.57 % of the variation in the data (56.74 %, 12.74 % and 8.09%, respectively) distribution of molecules over the three first principal components is shown in Fig. 1. As can be seen in this figure, no cluster exists in dataset.
Fig. 1
Principal components analysis of the calculated descriptors of all molecules in the data set.
Principal components analysis of the calculated descriptors of all molecules in the data set.After determination of homogeneity in dataset, models were built using training set. Before model building step, the pretreatment phase was carried out on pool of calculated descriptors. This pretreatment was begun with the deletion of constant descriptor for all molecules. Also for reduction of redundancy among retained descriptors, if two or more descriptors were highly correlated, only one descriptor with the highest correlation and dependent variable was picked and others were deleted. This pretreatment phase helps to accelerate the descriptor selection and decreases the probability of including unrelated descriptors in final model.Developed models were used to predict the activity of molecules in test set to evaluate the performance of the developed models.Dragon software was used for calculating four different classes of descriptors including constitutional, geometrical, topological, and functional group descriptors. The following procedure was employed to choose the most informative descriptors using the training set in each class. A certain MLR model was built with calculated descriptor of each class. The method for the selection of descriptor in developed model was a stepwise feature selection. The most significant molecular descriptors among the pool of calculated descriptors were identified using multiple linear regression analysis with a stepwise selection method. The developed MLR model for each class and its statistical parameters were reported in Table 3.
Table 3
The result of MLR analysis with different type of descriptors for training set molecules
The result of MLR analysis with different type of descriptors for training set moleculesAs can be seen in this Table, it was recognized that only 9 descriptors are enough to relate the bioactivity of investigated molecules to their structures. Table 4 shows the selected descriptors, their dentitions and classes. A number of the calculated descriptors estimated for each molecule encoded similar information about the molecule of interest. Hence, it was desirable to examine the pool of calculated descriptor and eliminate those which show high correlation with each other. Correlation coefficient (R2) descriptors matrix for the descriptors selected in various MLR equations is shown in Table 5. As you can see any descriptors correlated (R2 > 0.92) was assigned as criterion for correlated descriptor.
Table 4
List of selected descriptors for each class, their dentitions by genetic algorithm.
Table 5
Correlation coefficient (R2) matrix for the descriptors selected by MLR in various classes.
List of selected descriptors for each class, their dentitions by genetic algorithm.Correlation coefficient (R2) matrix for the descriptors selected by MLR in various classes.The most significant molecular descriptors among the selected descriptors were identified using a genetic algorithm (GA) selection method. Then these descriptors selected by GA were used as input of multiple linear regression analysis. The best equation obtained for the pIC50 of the 2-mercaptoimidazoles derivatives was:pIC50 = 3.847 (±2.165) + 9.562 (±2.773) × JhteZ + 0.062 (±0.021) × G(O..O) +10.894(±4.079) × J n = 21, R2 = 0.778, F = 18.741 (8)For evaluation of the predictability of the generated GA-MLR model, the optimized model was applied for prediction of pIC50 values of all compounds in the calibration and prediction set. The calculated pIC50 for each molecule is summarized in Table 6.
Table 6
The experimental pIC50 and the predicted values of the studied moleculesa
The experimental pIC50 and the predicted values of the studied moleculesaIt must be noted that positive values in the regression coefficients indicate that the given descriptor contributes positively to the value of pIC50 , whereas negative values indicate that the increase in the value of the descriptor lead to a decrease in the value of pIC50. Said another way, increasing JhteZ, G(O..O) and J will increase pIC50 of the investigated 2-mercaptoimidazoles derivatives. The standar-dized regression coefficient reveals the significance of an individual descriptor presented in the regression model. The increase in the absolute value of a coefficient, leads to an increase in the weight of the variable in the model. The effects of various descriptors on the biological activity are shown in Fig. 2. As can be seen the effects of JhetZ and J as two topological indexes are more significant than the other appeared descriptor in final GA-MLR. Experimental versus predicted values of pIC50 for molecules, obtained by the GA-MLR modeling, is shown graphically in Fig. 3A.
Fig. 2
Standardized coefficient of descriptors appeared in GA-MLR.
Fig. 3
Plots of predicted activities versus experimental activities for (A) GA-MLR, and (B) GA-ANN.
Standardized coefficient of descriptors appeared in GA-MLR.Plots of predicted activities versus experimental activities for (A) GA-MLR, and (B) GA-ANN.The statistical parameters calculated for the developed MLR model are presented in Table 7. The correlation coefficient R2, Q2, and RMSE for the prediction set are 0.78, 0.89 and 0.36, respectively. The chemical applicability domain of the developed GA-MLR model and the trustworthiness of the predictions are also confirmed by the leverage method. Values of leverage could be calculated for both training and test compounds. Calculating leverage for training set is useful for determining the compounds which influence the model in a way that they result in an unstable model. On the other hand calculating leverage for objects that were not used in model building (such as test set) is useful for assigning the applicability domain of the model.
Table 7
Statistical parameters obtained for the developed model for anti tubreculosis inhibitor activity of investigated compounds.
Statistical parameters obtained for the developed model for anti tubreculosis inhibitor activity of investigated compounds.In the Williams plot chemicals influential on the structural domain of the model, described by a hat value exceeding the threshold one (vertical line in Fig. 4A and 4B, can be demonstrated as compounds with unusual structural characteristics badly depicted in the training set, which could influence the calculated descriptors, selection for a better modeling of those chemicals. Outliers are compounds which their standardized residual values pass the threshold value (here, ±2σ, horizontal dashed line in Fig. 4A and 4B could be correlated with errors in the measured values of bioactivities. Consideration of Williams plot Fig. 5A implies that there is no response outlier in investigated data set. Two molecules namely 13 and 15 have a leverage value higher than warning leverage limit (0.476) but they have standard residual values between ± 2.0 standard deviation units. Hence these molecules can be considered as influent in fitting performance of model but there are no strong reasons to consider them as outliers to delete from studied data set. Williams plot showed further the trustworthiness of the predictions from another side.
Fig. 4
Plot of standardized residuals versus leverage values (Williams plot) for (A) GA-MLR and, (B) GAANN. The compounds included in the training and test sets, are denoted differently; the response outliers and structurally influential compounds, explained in the text, are denoted using numbers. The horizontal lines are the 2.0σ limit and the vertical one is the warning value of leverage (h* = 0.470).
Fig. 5
Mesh counter plots of output of GA-ANN (on the basis of RMSECV) to optimize networks parameters including linear rate, momentum, and number of hidden layer nodes (nH ) (A) nH = 2; (B) nH = 3; (C) nH = 4; (D) nH = 5; (E) nH = 6; and (F) nH =7.
Plot of standardized residuals versus leverage values (Williams plot) for (A) GA-MLR and, (B) GAANN. The compounds included in the training and test sets, are denoted differently; the response outliers and structurally influential compounds, explained in the text, are denoted using numbers. The horizontal lines are the 2.0σ limit and the vertical one is the warning value of leverage (h* = 0.470).Mesh counter plots of output of GA-ANN (on the basis of RMSECV) to optimize networks parameters including linear rate, momentum, and number of hidden layer nodes (nH ) (A) nH = 2; (B) nH = 3; (C) nH = 4; (D) nH = 5; (E) nH = 6; and (F) nH =7.In this nonlinear model, a network including a fully connected three layer, feed forward ANN model trained with a back propagation learning algorithm was used. GA-ANN had an input layer including neurons with the number of descriptors selected as the input of model (3 neurons), a hidden layer of neurons in which the number of neurons must be optimized and also a transfer function, and a single neuron output layer corresponding to the activity vector that it’s elements are calculated bioactivities of studied molecules by network. There are no exact theoretical principles for choosing the appropriate network topology, so before the training of network, the adjustable parameters such as number of nodes in the hidden layer, transfer function, learning rate and etc. were optimized. In order to evaluate the ANN, root mean square error of cross validation (RMSECV) was used. The values resulting from hidden layer are transferred to the last layer, which contains a single neuron representing the predicted activity. For output layer a linear transfer function was chosen. Also for hidden layer, a sigmoid transfer function, as a more flexible transfer function, was selected.To optimize the value of network parameters on performance of developed model, some various configurations of ANN with different values of neuron in hidden layer (nH = 2, 3, 4, 5,6 and 7), learning rate (from 0.1 to 1) and momentum (from 0.1 to 1) were built, and output of each network on the basis of RMSECV was evaluated. A special technique using response mesh plot was employed to optimize number of node in hidden layer, learning rate, and momentum. In Fig. 5, the mesh plots of output of developed model (on the basis of RMSECV) as a linear function of learning rate and momentum in six different numbers of nodes in hidden layer are shown. It must be noted that for inhibition of overfitting in the generated ANN model, the training of the network must be performed when the RMSECV of calculated activity by network is in the minimum value.The results show that 5 nodes in hidden layer, learning rate of 0.6 and momentum of 0.1 are the optimum parameters of model. After optimization of pervious parameters, the number of iterations must be optimized. Fig. 6 shows a plot of the RMSECV for training set versus the number of epochs which represents the estimation of the extent of training period. It can be seen from this figure that while training of network was performed for the training set; RMSECV initially decreases and then begins to increase after approximately 1900 epochs. This position is commencing point of overtraining of network and then 1900 iteration was chosen as the optimum number of epoch. The generated nonlinear model was then trained using the training set for optimization of the weights and biases. For estimate of the predictability of the generated ANN, a trained network was applied for prediction of the pIC50 s values in the test set which were not used in the modeling step. The predicted activity of molecules calculated by GA-ANN is plotted against the experimental values in Fig. 3B and is reported in Table 6. As expected, the calculated values are in good agreement with experimental values.
Fig. 6
Plot of RMSECV for training set versus the number of iterations
Plot of RMSECV for training set versus the number of iterationsTable 7 compares the results obtained using the GA-MLR and GA-ANN models. The R2, RMSE and PRESS of the models for training and test sets reveal the potential of the ANN model for prediction of pIC50 s values of various 2-mercaptoimidazoles as CCR2 inhibitors.RMSE and PRESS of 0.34 and 0.58 for the test set by the GA-MLR model should be compared with the values of 0.12 and 0.07 by the GA-ANN model. It can be seen from Table 7 that although parameters appearing in the GA-MLR model are used as inputs for the generated GA-ANN model, the statistics have shown a large improvement. These improve-ments are because pIC50 s values of 2-mercaptoimidazoles reveal nonlinear correl-ations with the selected descriptors by genetic algorithm.Same with GA-MLR, to better estimate the developed GA-ANN, the Williams plot was constructed to verify the presence of outliers and/or molecules with high influence on the results (Fig. 4B). As discussed for GA-MLR leverage values and standardized residuals in prediction of activity of molecules are reported, respectively, on x and y axes. In this plot, reference lines are also reported both for leverage critical value (0.470) and for standardized residuals critical value (±2σ) Molecules with leverage greater than the critical value can be considered as objects with too much influence on the regression model.In the same manner, molecules with a standardized residual greater than the critical value are described by a poor prediction value. By examining the applicability domain of the GA-ANN from the Williams plot (Fig. 4B) it can be seen that neither of the molecules in the set is recognized as a response outlier based on the 2σ criterion for the total molecules. On the other hand, on the basis of leverage approach two compounds from the investigated molecules are recognized as structurally influential chemicals: molecules 13 and 15. These results are same with GA-MLR model.In order to avoid chance correlations which are possible because of a large number of generated columns (independent variables), and examine the robustness of developed models, Y-randomization test has been applied to models. The dependent variable vector is randomly permuted and a new QSAR models is constructed using the original independent variable matrix. The new modeling was expected to have low R2 and Q2 values. For sureness, a number of iterations were carried out. If the results show high R2and Q2, it implies that an acceptable QSAR model cannot be obtained. Several random shuffles of the Y vector were performed on the generated models and the results are shown in Table 8. The low R2 and Q2 values show that the good results in our original model are not due to a chance correlation or structural dependence of the training set.
Table 8
R2 and Q2 obtained in two models by Y randomization.
R2 and Q2 obtained in two models by Y randomization.
DISCUSSION
By interpreting the descriptors included in the model, it is possible to obtain valuable chemical insights into the biological activity. For this reason, a brief explanation of the three descriptors that were employed in the generated GA-MLR model is provided below. JhetZ is Balaban type index from Z weighted distance matrix and J is Balaban J index. Both JhetZ and J are belonging to topological descriptor class. Presence of these descriptors in the final MLR model, basically accounts for size, shape, and branching, thus steric contribution to biological activity. The structures of almost all 2-mercaptoimidazoles included in this QSAR study are very similar to each other. These structures have an imidazole ring in the center of molecule and three substituents in positions 1, 2, and 3 that more or less have similar structure. Therefore, appearing of the topological descriptors such as J and JhteZ in the model is not unusual. These topological descriptors encode the compactness and the degree of branching of a molecule.G(O..O) is sum of geometrical distance between two oxygen atoms in studied molecules.Because this descriptor belongs to the geometrical group of descriptors, some geometrical properties including angles between atoms, dihedrals angles, and atomic distances are probably important features in the effectiveness of these compounds as CCR2 inhibitors. The variables appeared in the GA-MLR model encode different aspects of topological and geometrical molecular structure.One of the most important reasons for this study is comparison of ability of linear QSAR model building methods (such as MLR) and non linear QSAR model building techniques (such as ANN) in predicting the inhibitory activity of some 2-mercaptoimidazoles as CCR2 inhibitors. To obtain robust and accurate models, the ANN models should be trained by subset of descriptors instead of all generated descriptors.As discussed above, a genetic algorithm technique was applied as a feature selection method to choose the most relevant subset of descriptors. Said another way, to find robust and predictable model, layered feed forward back propagation neural network model was trained with subsets of descriptors instead of all calculated descriptors.Therefore, an ANN model was developed by using the three descriptors appearing in the MLR model as its inputs. Since these descriptors were selected by GA, QSAR model was called GA-ANN.As a result, it was discovered that a correctly selected and trained neural network could reasonably represent the dependence of the CCR2 receptor inhibitory activities of 2-mercaptoimidazoles on the descriptors. The optimized neural network could then simulate the complex nonlinear relationship between the pIC50 s value and the descriptors.
CONCLUSION
QSAR were built for the CCR2 receptor inhibitory activity of some 2-mercaptoimidazoles by using the GA-MLR and GA-ANN methods. Comparison of the GA-MLR and GA-ANN models reveal superiority of the GA-ANN model over the GA-MLR model. Because the improvement of results acquired by using the non-linear model is substantial, it can be deduced there is a non-linear relationship between the pIC50 s and the calculated structural descriptors of the 2-mercaptoimidazoles. In the final models, importance of topological descriptors is considerable (including JhetZ and J). Presence of these descriptors in the final models, basically accounts for size, shape, and branching, thus steric contribution to biological activity.
Authors: Fabiana R de Sousa-Mast; Arianne C Reis; Marcelo C Vieira; Sandro Sperandei; Luilma A Gurgel; Uwe Pühse Journal: Int J Public Health Date: 2017-03 Impact factor: 3.380