Literature DB >> 19665859

Shuffling multivariate adaptive regression splines and adaptive neuro-fuzzy inference system as tools for QSAR study of SARS inhibitors.

M Jalali-Heravi1, M Asadollahi-Baboli, A Mani-Varnosfaderani.   

Abstract

In this work, the inhibitory activity of pyridine N-oxide derivatives against human severe acute respiratory syndrome (SARS) is predicted in terms of quantitative structure-activity relationship (QSAR) models. These models were developed with the aid of multivariate adaptive regression spline (MARS) and adaptive neuro-fuzzy inference system (ANFIS) combined with shuffling cross-validation technique. A shuffling MARS algorithm is utilized to select the most important variables in QSAR modeling and then these variables were used as inputs of ANFIS to predict SARS inhibitory activities of pyridine N-oxide derivatives. A data set of 119 drug-like compounds was coded with over hundred calculated meaningful molecular descriptors. The best descriptors describing the inhibition mechanism were solvation connectivity index, length to breadth ratio, relative negative charge, harmonic oscillator of aromatic index, average molecular weight and total path count. These parameters are among topological, electronic, geometric, constitutional and aromaticity descriptors. The statistical parameters of R2 and root mean square error (RMSE) are 0.884 and 0.359, respectively. The accuracy and robustness of shuffling MARS-ANFIS model in predicting inhibition behavior of pyridine N-oxide derivatives (pIC50) was illustrated using leave-one-out and leave-multiple-out cross-validation techniques and also by Y-randomization. Comparison of the results of the proposed model with those of GA-PLS-ANFIS shows that the shuffling MARS-ANFIS model is superior and can be considered as a tool for predicting the inhibitory behavior of SARS drug-like molecules.

Entities:  

Mesh:

Substances:

Year:  2009        PMID: 19665859      PMCID: PMC7126869          DOI: 10.1016/j.jpba.2009.07.009

Source DB:  PubMed          Journal:  J Pharm Biomed Anal        ISSN: 0731-7085            Impact factor:   3.935


Introduction

The discovery of a novel human coronavirus (H-CoV) as the cause of the newly recognized severe acute respiratory syndrome (SARS) provides a new challenge to the medical community to keep control on this disease [1]. Human coronaviruses cause up to 30% of colds and they sometimes cause a lower respiratory tract disease. In contrast, animal coronaviruses are known to cause devastating epizootics of respiratory or enteric diseases in livestock and poultry [2]. The SARS coronavirus is clearly new to the human population and its RNA genome differs substantially from sequences of all known coronaviruses. SARS, with high rates of transmission needs a rapid, sensitive and inexpensive treatment method that can be used to effectively prevent the rapid spread of the infection. Therefore, it is wise to develop safe and effective drugs against SARS-CoV as quickly as possible in case a novel widespread outbreak would occur. The development of effective drugs against SARS-CoV may also provide new strategies for the prevention or treatment of other coronavirus diseases in animals or humans [3]. SARS inhibitor has potential therapeutic value and has been extensively studied in pharmaceutical industry [4]. Recently, a total of 119 compounds that all belong to the class of the pyridine N-oxide derivatives with good inhibitory concentration has been reported against SARS-CoV [5]. To find and design new compounds with enhanced inhibitory activity, a systematic study of the different substituents on the activity of the analogues is needed. On the other hand, the growth of computational techniques has accelerated the drug design process. Many databases of inhibitors exist that have yet to be evaluated against SARS. Quantitative structure–activity relationship (QSAR) has been demonstrated as a capable tool for the investigation of bioactivity of various classes of compounds [6], [7], [8], [9], [10], [11], [12]. Experimental evaluation of inhibitory activity of newly designed compounds is time-consuming and expensive; as a result, it is of interest to develop a method for the prediction of biological activity before the synthesis. QSAR searches information relating chemical structure to biological activity by developing a mathematical model. Building of a QSAR model begins with calculating theoretical parameters or selecting structural features for the compounds involved. Nowadays, hundreds of descriptors could be generated in QSAR studies, but only some of them are statistically significant in terms of correlation with biological activity for a particular analysis. Therefore, variable selection techniques have become important for producing a useful predictive model. A suitable feature selection method ensures the model stability and the consistency of relationship between the descriptors and biological activity [13]. In order to make sure that the most important descriptors have been selected, shuffling cross-validation technique was used in this work. In this method, the data set was divided into several subsets, and variable selection process was performed for different combinations of these subsets. Then the most frequent descriptors in models were selected as most important variables describing the inhibitory effect. In this study, multivariate adaptive regression spline (MARS) combined with shuffling cross-validation (SCV) was employed to select the most important parameters describing SARS inhibitors activity. The selected descriptors were then used as inputs of adaptive neuro-fuzzy inference system (ANFIS) and a hybrid model called shuffling MARS–ANFIS was developed. As final step, the generated model was used to predict the activity of pyridine N-oxide derivatives as SARS inhibitors.

Computational methods

Multivariate adaptive regression splines

Multivariate adaptive regression spline (MARS) is a non-parametric regression method proposed by Friedman in 1991 [14], [15]. Nowadays, the MARS is used for analyzing biological, economical, sociological and other databases [16]. The main idea in MARS which makes it different from other methods is its ability for dividing the whole space of each independent variable into various sub-regions and then defining a different mathematical equation for each area. This equation relates each sub-region of independent variable to response of the system, separately. This framework makes the MARS a method that is useful for modeling non-linear and complicated systems and also applicable for the conditions which the behavior of the system is highly affected by just a specific area of independent variable. Generally a regression couple can be presented by (X , Y , which X , represents for one or, n, independent variable(s) and Y is a dependent variable. In the MARS model, for every independent variable there is/are one or more split point(s), named t . For X greater than t , there is one equation named right side-basis function (BF) and for X less than t there is another equation named left side-basis function. These two left and right basis functions (spline functions) relate X to the dependent variable Y . The following equations indicate the mathematical representation of right and left basis functions: where q(≥0) is the power to which the splines are raised and which determines the degree of smoothness of the resultant function estimate. Final response in MARS can be calculated by summing up all M basis functions with suitable coefficients (c ) as:where is the dependent variable predicted by MARS model, c 0 is a constant and B (X) is the mth basis function. To determine which basis function should be included in the model, MARS utilizes the generalized cross-validation (GCV). The GCV is mean squared residual error divided by a penalty dependent on the model complexity. The GCV is defined in the following way:where C(M) is the complexity penalty that increases with the number of basis functions in the model and can be defined as Eq. (5):where M is the number of basis functions in this equation and the parameter d is a penalty for each basis function included in the model. Large value of d leads to fewer basis functions and therefore smoother function estimates. The theory behind the multivariate adaptive regression spline has been adequately described elsewhere [17]. In this study, the data set containing p observations were divided into (p  −  k) calibration and k validation objects. The root mean square error of validation set (RMSEv) has been used as fitness function for the search algorithm. Because there are various states for selecting k samples out of p, various models can be built using external validation strategy. The parameters of the generated models depend on training and validation set, as a result various variables and split points are expected to be determine in this way. We have used the most frequent variables appearing in the built models as inputs for the final modeling.

Shuffling cross-validation

In this technique, the data set would be divided into several subsets, and variable selection process and model developing would be performed for all combinations of the subsets. Then the most frequent descriptors appeared in the developed models would be selected as most important variables in describing the variation in inhibitor activity. In the present work, the data set was randomly divided into six subsets (A–F). For variable selection procedure, four groups were applied as calibration set and the two remaining subsets were used as validation set for evaluating the selected parameters. Mathematically, there are fifteen possible states that one can select four unrepeated objects from six independent ones. The data set was divided into six subgroups, so, fifteen MARS models can be developed with various calibration and validation sets. The molecules included in subsets of A–F are shown in Table 1 . Fifteen different combinations of calibration and validation subsets were used in the present study to develop the MARS model. The use of shuffling MARS technique guarantees that the developed model is robust and reliable and it is not obtained by chance.
Table 1

Experimental and calculated inhibitor data using shuffling MARS–ANFIS model for pyridine N-oxide derivatives.

No.SubsetaRbX1X2X3X4X5Z1Z2Y1Y2Y3Y4Exp. pIC50MARS–ANFIS pIC50
1AHHHHHHOOHHHH3.8403.866
2CHMeHHHHOOHHHH4.0604.272
3BHHMeHHHOOHHHH4.0594.327
4DHHMeHHHOOHHHH3.8253.952
5AHHHMeHHOOHHHH4.0404.232
6BHHHMeHHOHHHH4.3524.845
7EHMeHMeHHOOHHHH4.2394.082
8DHMeHHMeHOOHHHH3.9384.359
9CHMeHHMeHOHHHH3.7363.325
10FHMeMeHHMeOOHHHH4.4234.721
11AHMeMeHHMeOHHHH5.2945.128
12EHHHEtHHOOHHHH4.3614.150
13DHHHiPropHHOOHHHH4.2394.633
14CHiPropHHiPropHOOHHHH4.4914.699
15BHiPropHHiPropHOHHHH4.7174.327
16EHHHtButHHOOHHHH4.5024.663
17DHHHtPentHHOOHHHH4.6524.626
18AHHHOMeHHOOHHHH4.7564.874
19FHHHOMeHHOHHHH3.9553.910
20CHOMeHHOMeHOOHHHH5.0985.199
21EHHOMeOMeHHOOHHHH3.7123.738
22BHHOMeOMeHHOHHHH3.8784.083
23CHHOMeOMeOMeHOOHHHH5.3215.680
24FHHOMeOMeOMeHOHHHH3.8233.498
25EHOMeHHMeHOOHHHH3.8114.002
26AHOMeHHMeHHHHH3.8343.752
27EHOEtHHHHOOHHHH4.0373.952
28DHOEtHHHHOHHHH3.8993.905
29CHHFHHHOOHHHH3.6513.250
30BHHHFHHOHHHH4.1484.250
31FHClHHHHOOHHHH3.4613.138
32AHClHClHHOOHHHH4.1744.007
33EHClHHHClOOHHHH4.7104.872
34BHHClClHHOOHHHH4.3274.140
35DHClClHHClOOHHHH5.0404.892
36FHClClClClClOOHHHH4.7344.809
37AHClClMeClClOOHHHH5.5465.770
38EHClHNO2HHOOHHHH5.3025.119
39AHHBrHHHOOHHHH4.7324.567
40CHBrHHOMeHOOHHHH4.6474.170
41FHiPropHBriPropHOOHHHH4.3784.190
42BHIHHHHOOHHHH4.9724.892
43FHNO2HHHHOOHHHH4.1764.103
44CHHHNO2HHOOHHHH4.3554.349
45EHHNO2HNO2HOOHHHH4.5884.404
46BHHNO2MeHHOOHHHH4.8874.438
47FHHMeNO2HHOOHHHH4.7264.421
48CHMeHHNO2HOOHHHH4.7994.849
49AHOMeHHNO2HOOHHHH3.7333.717
50DHHNO2ClHHOOHHHH4.2814.439
51FHCNHHHHOOHHHH4.8834.672
52BHHHCNHHOOHHHH4.2624.069
53EHHHPheHHOOHHHH4.3594.188
54DHOPheHHHHOOHHHH4.8504.615
55AHHOMeOBzHHOOHHHH4.5334.783
56BHHCF3HHHOHHHH4.6674.873
57CHOHHHNO2HHHHH3.7473.250
58FMeHHHHHOHHHH3.8763.994
59BMeHHMeHHOHHHH5.4205.237
60CMeMeHHMeHHHHH5.6465.860
61DMeHHFHHOHHHH3.5453.407
62AMeClHHMeHOHHHH5.9055.390
63EEtHHHHHOOHHHH6.1925.717
64CEtMeHHMeHOOHHHH3.6583.525
65FPropHHHHHOOHHHH3.5823.633
66BPropHHHHHHHHH3.8313.934
67DPropMeHHMeHHHHH3.6223.473
68AHeptHMeMeMeHHHHH4.4954.380
69FHeptMeHHMeHOOHHHH4.2524.272
70DUndecMeHHMeHOOHHHH5.0434.767
71CIsobutMeHHMeHOOHHHH3.6913.399
72EC3H6MeHHMeHOOHHHH4.7314.549
73BC6H5MeHHHHOOHHHH3.9223.603
74CC6H5MeHHMeHOOHHHH3.8593.781
75DC6H5MeHHMeHHHHH3.9153.749
76FCH2PhMeHHMeHHHHH3.9764.003
77ACNMeHHMeHHHHH5.1115.216
78DCH2CO2HMeHHMeHOOHHHH3.9003.617
79FBrMeHHMeHOOHHHH3.5673.598
80BCO2CH3MeHHHHOHHHH3.6433.576
81CCO2CH3MeHHMeHOOHHHH3.7693.392
82DCO2CH3HOPhHHHOOHHHH3.9693.860
83FCF3MeHHMeHOOHHHH4.1864.199
84ACH2OMeMeHHMeHOOHHHH3.9263.480
85EMe, ClHHHHHOOHHHH4.1433.895
86CMe, ClMeHHMeHOOHHHH4.4744.177
87BMeHHMeHMeOOHHMeH4.1513.881
88DHHHHHHOOHHHMe4.0874.356
89FMeHHMeHMeOOHHHMe3.9733.970
90AMeHHMeHHOHHHMe3.9053.804
91EMeHHHHHOOHHHMe3.8173.616
92DMeHMeHHHOOHHHMe4.2134.120
93BMeHHMeHEtOOHHHMe4.4234.289
94FClHHHClHHHHMe4.2034.127
95CHHHHHMeOOHHHMe3.9863.957
96EClHHHHHOOHHHMe3.6653.634
97AMeNO2HHHHOOHHHMe4.1524.113
98BMeHMeHHMeOOHHHMe4.5264.321
99FClHHHHMeOOHHHMe3.8323.577
100DMeNO2HHHMeOOHHHMe3.9143.719
101EMeHHMeHHOOHHHOMe3.5453.801
102AMeHHMeHHOHHHOMe4.5854.150
103CMeHHMeHHHHHOMe3.6723.983
104EMeHHMeHHHHHOH3.5953.088
105BHHOMeHHHOOHHt-BuH3.6043.629
106DHHOMeHHHHHt-BuH3.9464.304
107EHHHHHHClHHH3.8013.871
108AMeHHMeHHOOClHHH3.7203.324
109BMeHHMeHMeOOClHHH4.8924.610
110FHHHHHHHClHH4.8604.651
111CMeHHMeHHOOHClHH3.6333.495
112EMeHHMeHHOHClHH4.7094.628
113AHHHHHClOOHClHH4.3254.674
114DHHHHHHOOHHHCl5.1414.740
115FHHHHHHOHHHCl5.2775.216
116CMeHHMeHHHHHCl4.8604.638
117BMeHHMeHMeOOHHHCl3.8244.007
118DMeHHMeHClOOHHHCl5.2835.216
119AHHHHHHHHHNO24.3634.643

A–F subsets.

Substituted groups in pyridine N-oxide derivatives is shown in Fig. 2.

Experimental and calculated inhibitor data using shuffling MARS–ANFIS model for pyridine N-oxide derivatives. A–F subsets. Substituted groups in pyridine N-oxide derivatives is shown in Fig. 2.
Fig. 2

Main skeleton with different functional positions of pyridine N-oxide derivatives.

Adaptive neuro-fuzzy inference system

The proposed neuro-fuzzy model in ANFIS is a multilayer neural network-based fuzzy system [18]. Its topology is shown in Fig. 1 , and as can be seen the system has a total of five layers. In this connectionist structure, the input (layer 0) and output (layer 5) nodes represent the descriptors and the response, respectively, and in the hidden layers, there are nodes functioning as membership functions (MFs) and rules. This eliminates the disadvantage of a normal feed forward multilayer network, which is difficult for an observer to understand or to interpret its results. ANFIS simulates TSK (Takagi–Sugeno–Kang) fuzzy rule [19] of type-3 where the consequent part of the rule is a linear combination of input variables and a constant. For a Sugeno fuzzy model a common rule set with the fuzzy if-then rules is as following:
Fig. 1

A typical ANFIS structure.

Rule 1: IF x is A 1 and y is B 1 THEN Rule 2: IF x is A 2 and y is B 2 THEN A typical ANFIS structure. For simplicity, we assume that the examined fuzzy inference system has two inputs x and y and one output. The ANFIS contains five layers (Fig. 1): Layer 1. The fuzzy part of ANFIS is mathematically incorporated in the form of membership functions (MFs). A membership function μ (x) can be any continuous and piecewise differentiable function that transforms the input value x into a membership degree, that is to say a value between 0 and 1. The most widely applied membership functions are the generalized bell (gbell MF) or the Gaussian function in Eqs. (6), (7), which are described by the three parameters, a–c. Therefore, Layer 1 is the fuzzification layer in which each node represents a membership: As the values of the parameters {a , b and c } change, the bell-shaped functions vary accordingly, thus exhibiting various forms of membership functions on linguistic label Ai. Parameters in this layer are referred to as premise parameters. Layer 2. Every node in this layer is a fixed node labeled, whose output is the product of all the incoming signals:Every node in this layer computes the multiplication of the input values and gives the product as the output as in the above equation. The membership values represented by μ (x) and μ (y) are multiplied in order to find the firing strength of a rule where the variables x and y have linguistic values Ai and Bi, respectively. Layer 3. This layer is the normalization layer which normalizes the strength of all rules according to Eq. (9):where w is the firing strength of the ith rule which is computed in layer 2. Node i computes the ratio of the ith rule's firing strength to the sum of all rules’ firing strengths. For convenience, outputs of this layer are called normalized firing strengths. Layer 4. Every node i in this layer is an adaptive node with a node function:where w is a normalized firing strength from layer 3 and {p , q , r } is the parameter set of this node. Parameters in this layer are referred to as consequent parameters. Layer 5. The single node in this layer is a fixed node labeled Σ, which computes the overall output as the summation of all incoming signals:Thus we have constructed an ANFIS system that is functionally equivalent to Sugeno fuzzy model. This system is used in the present QSAR study due to its transparency and efficiency.

Data set collection and descriptor generation

A set of 119 variously functionalized pyridine N-oxide was collected along with their activity data [5]. The IC50 values were converted to pIC50 values and used as dependent variables in the QSAR study. The main skeleton with different functional positions for pyridine N-oxide derivatives is shown in Fig. 2 . A list of inhibitory activities is given in Table 1. Prior to the calculation of the molecular descriptors, the 3D structures of the studied compounds were optimized using semi-empirical quantum-chemical methods of PM3 implemented in Hyperchem computer program [20]. In this work, over hundred meaningful descriptors were calculated for each compound, which encoded different aspects of the molecular structures. These descriptors were consisted of constitutional, topological, electronic, geometric and empirical descriptors. Pairs of descriptors that were highly correlated (R  > 0.90) encoded similar information, and therefore one of them has been eliminated. Descriptors with constant or almost constant values for all molecules were also eliminated. All these molecular descriptors were generated using Dragon3 software [21]. Table 2 shows 15 different combinations of calibration and validation subsets used for the variable selection via shuffling MARS. Fig. 3 shows the selected descriptors and the frequency of each descriptor that has been appeared in the shuffling MARS models. Shuffling MARS–ANFIS algorithm was written in our laboratory using MATLAB 7.0 [22] and run on a personal computer (Intel Pentium processor 4/1.8 GHz 1 GB RAM).
Table 2

Selecting the important variables using shuffling MARS method.

RunCalibration setR2CalRMSECalValidation setR2ValRMSEVal
1A + B + C + D0.8340.241E + F0.7670.458
2A + B + C + E0.8200.268D + F0.8100.372
3A + B + D + E0.8310.279C + F0.8050.367
4A + C + D + E0.8350.253B + F0.7510.476
5B + C + D + E0.8030.226A + F0.7400.450
6A + B + C + F0.8330.240D + E0.8020.393
7A + B + D + F0.8190.273C + E0.7830.422
8A + C + D + F0.8430.226B + E0.7450.449
9B + C + D + F0.8250.282A + E0.7300.470
10A + B + E + F0.8390.228C + D0.8040.418
11A + C + E + F0.8130.265B + D0.8060.416
12B + C + E + F0.8260.250A + D0.7840.464
13A + D + E + F0.8370.242B + C0.7870.466
14B + D + E + F0.8210.255A + C0.7690.471
15C + D + E + F0.8180.235A + B0.7500.483



Mean0.8270.2510.7760.438
Fig. 3

The selected descriptors and the frequency of each one in the shuffling MARS models.

Main skeleton with different functional positions of pyridine N-oxide derivatives. Selecting the important variables using shuffling MARS method. The selected descriptors and the frequency of each one in the shuffling MARS models.

Results and discussion

Shuffling MARS–ANFIS modeling

First, all 119 molecules studied in this work were sorted according to their biological activity. Then the molecules were divided into six groups, five groups of them consisted of twenty molecules each and one consisted of nineteen molecules. Each group was selected in such a way that it consisted of all range of inhibitory activity from weak to highly active compounds. In the variable selection procedure, four groups were applied as calibration set and the two remaining subsets were used as validation set for evaluating the selected parameters. The data set was divided into six subgroups, so, we can make 15 MARS models with various calibration and validation sets. Because these calibration and validation sets contain different molecules, various descriptors are expected to be selected by MARS search strategy, in each model. In the calibration procedure, the forward selection and backward deletion algorithm uses the parameter, root mean square of validation set (RMSEv) as an index for evaluating the selected split points. Statistical parameters obtained for 15 models are shown in Table 2. The selected descriptors and the frequency of each descriptor in shuffling-MARS models are shown in Fig. 3. Inspection of this figure shows that parameters of solvation connectivity index (X3sol), length to breadth ratio (L/Bw), relative negative charge (RNCG), harmonic oscillator of aromatic index (HOMA), average molecular weight (AMW) and total path count (TPC) have appeared more frequently (more than 10 runs) in the 15 runs compared to the other descriptors. These six descriptors are among topological, electronic, geometric, constitutional and aromaticity descriptors. The detailed description of these descriptors is given in Reference [23]. The most important selected variables (six variables) using the shuffling MARS algorithm were used as inputs for developing the ANFIS model to predict the value of pIC50 for the SARS inhibitors. The ANFIS modeling involves two steps: (a) structure identification and (b) parameter identification. The former is related to finding a suitable number of rules and a proper partition of the feature space. The latter is concerned with the adjustment of system parameters, such as membership function (MF) parameters, linear coefficients, and so on. It is concluded that by increasing the number of MFs per input, the number of rules increases accordingly [13]. For the first stage of ANFIS modeling grid partitioning was used for partitioning the features. The number and type of membership functions were optimized using RMSE as a criterion for the test set. For the ANFIS modeling, data set was divided into three groups: training, test and prediction sets. All molecules were randomly included in these sets. The training set, consisted of 70 molecules and was used for the model generation. However, the test set, consisted of 30 molecules, was used to take care of the overtraining. The prediction set, consisted of 19 molecules, was used to evaluate the generated model. The predicted values of pyridine N-oxide inhibition behavior obtained using shuffling MARS–ANFIS model are listed in Table 1. This table shows that the calculated pIC50 is a good estimate of experimental pIC50. The correlation between the experimental and calculated values of pIC50 is shown in Fig. 4 . The adjusted R 2 for train, test and prediction set in shuffling MARS–ANFIS model are 0.856, 0.862 and 0.870, respectively. Also the RMSE for train, test and prediction set are 0.285, 0.337 and 0.382, respectively. The residuals of the calculated values of pIC50 are plotted against the experimental values in Fig. 5 . The propagation of the residuals in both sides of zero line indicates that no symmetric error exists in the development of the QSAR model. From this figure, one can find there is no out-layer in the generated shuffling MARS–ANFIS model.
Fig. 4

Plot of the shuffling MARS–ANFIS calculated pIC50 values against the experimental ones for the training, test and validation sets.

Fig. 5

Plot of residuals versus experimental values of pIC50 for the shuffling MARS–ANFIS.

Plot of the shuffling MARS–ANFIS calculated pIC50 values against the experimental ones for the training, test and validation sets. Plot of residuals versus experimental values of pIC50 for the shuffling MARS–ANFIS.

Validation of shuffling MARS–ANFIS model

Second step of this work was investigating the validity of the generated model. The consistency and reliability of a method can be explored using the cross-validation techniques [24]. The cross-validation techniques of leave-one-out (LOO-CV) and leave-multiple-out (LMO-CV) were used to assess the consistency of the model. In order to examine the robustness of the developed model, the Y-randomization test was performed in this contribution. In LOO-CV algorithm, one compound was left in each step as prediction set and the model was developed using the remaining molecules as training set [24]. The accuracy of cross-validation results is extensively acceptable in the literature considering Q 2 LOO value using Eq. (11): In this sense, a high value for the statistical parameter (Q 2  > 0.5) is considered as proof of high predictive ability of the model [25]. However, several authors suggest that a high value of appears to be necessary but not sufficient [26]. Consequently, we also used LMO-CV and Y-randomization techniques. In the case of LMO, M represents a group of randomly selected data points which would leave out at the beginning and would be predicted by the model which was developed using the remaining data points. Therefore, M molecules are considered as prediction set. The can be calculated by using Eq. (12):In the present contribution, we have performed leave-12-out (L12O) and leave-18-out (L18O) cross-validations. A group of 12 and 18 compounds was randomly selected, respectively from the training set. Then each group was left out and was predicted by the model developed from the remaining observations. This procedure was carried out 1000 times. Table 3 shows the results for LOO and LMO cross-validations. High values for and R 2 indicate the consistency of the developed model. In order to assess the robustness of the shuffling MARS–ANFIS, the Y-randomization test was applied in this contribution. The dependent variable vector pIC50 was randomly shuffled and a new QSAR model was developed using the original variable matrix. The new QSAR model is expected to show a low value for and . One hundred random shuffles of the y vector were performed for which the results are shown in Table 4 . The poor values for the mean of R 2 p and indicate that the good results of the shuffling MARS–ANFIS model are not due to a chance correlation or structural dependency of the training set.
Table 3

Statistics using LOO-CV and LMO-CV methods for comparing the results of shuffling MARS–ANFIS method with GA-PLS-ANFIS method.

MethodLOO
L12Oa
L18Oa
Q2RMSEpR2bRMSEpR2RMSEp
Shuffling MARS–ANFIS0.8920.3310.8840.3590.8700.380
GA-PLS-ANFISc0.8130.4460.7870.4890.7850.494

Calculation of was based on 1000 random selections of groups of 12 and 18 samples.

All R2 are adjusted coefficient regression.

Selected variables: X3sol, TPC, RNCG and AROM.

Table 4

Mean values of and after performing 100 Y-randomization tests.

MethodMean of Rp2Mean of QLOO2
Shuffling MARS–ANFIS0.1850.096
GA-PLS-ANFIS0.2360.143
Statistics using LOO-CV and LMO-CV methods for comparing the results of shuffling MARS–ANFIS method with GA-PLS-ANFIS method. Calculation of was based on 1000 random selections of groups of 12 and 18 samples. All R2 are adjusted coefficient regression. Selected variables: X3sol, TPC, RNCG and AROM. Mean values of and after performing 100 Y-randomization tests.

Comparison of shuffling MARS–ANFIS with GA-PLS-ANFIS

For further investigation, GA-PLS technique is also used to select the most important descriptors in the present work. The theories behind this algorithm are discussed elsewhere [27]. To find the best model, GA-PLS were run many times with different settings of initial populations. The best models of GA-PLS with best fitness were selected. Fig. 6 shows the result of GA-PLS variable selection after 3000 runs. This figure shows the most important descriptors are X3sol, TPC, RNCG and AROM (aromaticity). The selected descriptors appeared in GA-PLS model were used in developing ANFIS model to predict the value of pIC50. The results of Q 2 LOO, R 2 LMO and RMSEp for LOO, L12O and L18O in GA-PLS-ANFIS model are summarized in Table 3. This table shows that the best model also has four variables for GA-PLS technique. The poor values for the mean of adjusted and in Table 4 confirm that the good results of the GA-PLS-ANFIS model are not due to a chance correlation and the developed model is reliable.
Fig. 6

Selected variables using GA-PLS method after 3000 runs.

Selected variables using GA-PLS method after 3000 runs. It is clear from Table 3 that the results of LOO, L12O and L18O for the shuffling MARS–ANFIS model are superior compared with those of the GA-PLS-ANFIS. However, the shuffling MARS–ANFIS model has 6 descriptors and the GA-PLS-ANFIS model has 4 descriptors, but the adjusted R 2 is relatively independent from the number of variables. It is obvious that the RMSE of both LOO and LMO has been reduced about 50% using shuffling MARS–ANFIS.

Descriptors appeared in QSAR model

The most repeated variable in the shuffling MARS–ANFIS model is X3sol which is among salvation connectivity indices. These molecular descriptors are defined to model salvation entropy and describe dispersion interactions in solution. The next important variable selected by the shuffling MARS–ANFIS model is relative negative charge (RNCG). This descriptor is the partial charge of the most negative atom divided by the total negative charge and is defined by the following equation.Different hetero atoms such as nitrogen, oxygen and halogen affect Q total and Q max dramatically. Also the presence of donor–acceptor atoms for H bond influences the value of both Q total and Q max. Therefore, the presence of these functions is important in inhibitor–isozyme interaction. It is shown that another important factor in inhibition mechanism is Length to breadth ratio (L/Bw) of the inhibitor [28]. Length to breadth ratio is defined as the ratio of the longest (L) to the shortest (B) side of the rectangle that envelopes the molecular structure and at the same time maximizes the L/B ratio. This shape parameter not only accounts for the distance between extreme atoms along the principle axis but also for the distribution of all atoms around the molecule center. The parameter TPS is the total path count of the H-depleted molecular structure and is a useful quantitative measure of molecular complexity. The TPS parameter for molecules with simple structures is smaller than those calculated for molecules with various branching in their structures [23]. The parameter HOMA is harmonic oscillator model of aromacity index and is among resonance indices. The resonance indices are theoretical quantities to explain the stability of benzene and predicting the degree of delocalization of conjugated systems [23]. The last parameter which has been used for modeling and has acceptable frequency of repetition in shuffling-MARS approach is the average molecular weight (AMW). This parameter is calculated by dividing the molecular weight by the number of atoms in the considered molecule. This parameter is a simple molecular descriptor which encodes information on elemental composition of the molecule.

Conclusions

A cumbersome step in every QSAR studies is selecting suitable descriptors using a feature selection method. This is more serious when the data set under study is diverse or the mechanism of the process is complex. The data set considered in this work consisted of drug-like molecules inhibiting SARS and consequently, the mechanism of their action could be complicated. An approach of shuffling MARS–ANFIS was successfully applied for predicting the inhibitor activity of pyridine N-oxide derivatives against SARS. The reasons behind this success could be: (1) the strength of the shuffling MARS as feature selection technique. It is shown that the six parameters of AMW, X3sol, LBw, RNCG, HOMA and TPC chosen by shuffling MARTS affect significantly the inhibition process of the drug-like molecules. (2) The role of ANFIS as mapping model which has the power of prediction of the inhibition behavior. It is a general framework that combines two technologies, namely neural networks and fuzzy systems; by using fuzzy techniques, both numerical and linguistic knowledge can be combined into a fuzzy rule, which require extensive trails and errors for the optimization of their architecture. The shuffling MARS–ANFIS has been testified to be an effective method for variable selection and developing model by using the cross-validation techniques of leave-one-out, leave-multiple-out and also Y-randomization. Comparing the results of GA-PLS-ANFIS with those for shuffling MARS–ANFIS reveals that the latter model selects the best variables to predict the inhibition action of pyridine N-oxide derivatives. The appearance of the above-mentioned parameters in the model indicates that type of the atoms, size of the molecule, complexity of the compound, aromacity and elemental composition of the molecule are playing roles in the mechanism of inhibition.
  18 in total

1.  The impact of variable selection on the modelling of oestrogenicity.

Authors:  T Ghafourian; M T D Cronin
Journal:  SAR QSAR Environ Res       Date:  2005 Feb-Apr       Impact factor: 3.000

2.  Prediction of gastro-intestinal absorption using multivariate adaptive regression splines.

Authors:  E Deconinck; Q S Xu; R Put; D Coomans; D L Massart; Y Vander Heyden
Journal:  J Pharm Biomed Anal       Date:  2005-07-22       Impact factor: 3.935

3.  Exploration of linear modelling techniques and their combination with multivariate adaptive regression splines to predict gastro-intestinal absorption of drugs.

Authors:  E Deconinck; D Coomans; Y Vander Heyden
Journal:  J Pharm Biomed Anal       Date:  2006-07-21       Impact factor: 3.935

4.  QSAR analysis of tyrosine kinase inhibitor using modified ant colony optimization and multiple linear regression.

Authors:  Wei-min Shi; Qi Shen; Wei Kong; Bao-xian Ye
Journal:  Eur J Med Chem       Date:  2006-09-22       Impact factor: 6.514

Review 5.  Proteomics, networks and connectivity indices.

Authors:  Humberto González-Díaz; Yenny González-Díaz; Lourdes Santana; Florencio M Ubeira; Eugenio Uriarte
Journal:  Proteomics       Date:  2008-02       Impact factor: 3.984

Review 6.  Variable selection methods in QSAR: an overview.

Authors:  Maykel Pérez González; Carmen Terán; Liane Saíz-Urra; Marta Teijeira
Journal:  Curr Top Med Chem       Date:  2008       Impact factor: 3.295

Review 7.  Predicting antimicrobial drugs and targets with the MARCH-INSIDE approach.

Authors:  Humberto González-Díaz; Francisco Prado-Prado; Florencio M Ubeira
Journal:  Curr Top Med Chem       Date:  2008       Impact factor: 3.295

8.  Medicinal chemistry and the molecular operating environment (MOE): application of QSAR and molecular docking to drug discovery.

Authors:  Santiago Vilar; Giorgio Cozza; Stefano Moro
Journal:  Curr Top Med Chem       Date:  2008       Impact factor: 3.295

Review 9.  Applications of 2D descriptors in drug design: a DRAGON tale.

Authors:  Aliuska Morales Helguera; Robert D Combes; Maykel Pérez González; M Natália D S Cordeiro
Journal:  Curr Top Med Chem       Date:  2008       Impact factor: 3.295

10.  From genome to drug lead: identification of a small-molecule inhibitor of the SARS virus.

Authors:  Andrea J Dooley; Nice Shindo; Barbara Taggart; Jewn-Giew Park; Yuan-Ping Pang
Journal:  Bioorg Med Chem Lett       Date:  2005-12-01       Impact factor: 2.823

View more
  3 in total

1.  Structure-activity relationship for Fe(III)-salen-like complexes as potent anticancer agents.

Authors:  Zahra Ghanbari; Mohammad R Housaindokht; Mohammad Izadyar; Mohammad R Bozorgmehr; Hossein Eshtiagh-Hosseini; Ahmad R Bahrami; Maryam M Matin; Maliheh Javan Khoshkholgh
Journal:  ScientificWorldJournal       Date:  2014-04-06

2.  Application of Multivariate Adaptive Regression Splines (MARSplines) for Predicting Antitumor Activity of Anthrapyrazole Derivatives.

Authors:  Marcin Gackowski; Karolina Szewczyk-Golec; Robert Pluskota; Marcin Koba; Katarzyna Mądra-Gackowska; Alina Woźniak
Journal:  Int J Mol Sci       Date:  2022-05-04       Impact factor: 5.923

3.  Body fat percentage prediction using intelligent hybrid approaches.

Authors:  Yuehjen E Shao
Journal:  ScientificWorldJournal       Date:  2014-03-02
  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.