| Literature DB >> 18959785 |
Noel M O'Boyle1, David S Palmer, Florian Nigsch, John Bo Mitchell.
Abstract
BACKGROUND: We present a novel feature selection algorithm, Winnowing Artificial Ant Colony (WAAC), that performs simultaneous feature selection and model parameter optimisation for the development of predictive quantitative structure-property relationship (QSPR) models. The WAAC algorithm is an extension of the modified ant colony algorithm of Shen et al. (J Chem Inf Model 2005, 45: 1024-1029). We test the ability of the algorithm to develop a predictive partial least squares model for the Karthikeyan dataset (J Chem Inf Model 2005, 45: 581-590) of melting point values. We also test its ability to perform feature selection on a support vector machine model for the same dataset.Entities:
Year: 2008 PMID: 18959785 PMCID: PMC2603525 DOI: 10.1186/1752-153X-2-21
Source DB: PubMed Journal: Chem Cent J ISSN: 1752-153X Impact factor: 4.215
Figure 1Outline of the WAAC algorithm.
Figure 2Value of the objective function for the best model at each iteration of the WAAC algorithm for the PLS model (top) and the SVM model (bottom). The figures on the right, (b) and (d), show the effect of having a single optimisation phase without any winnowing. Ten repetitions of the algorithm are shown, with corresponding repetitions starting from the same initial random seed.
Description of the best models found by the WAAC algorithm
| WAAC/PLS | WAAC/SVM | |
| Number of descriptors | 68 | 28 |
| 2D descriptors | petitjean, weinerPath, weinerPol, a_ICM, b_1rotR, chi0_C, chi1, reactive, a_heavy, a_nH, a_nF, a_nO, a_nS, VadjEq, VadjMa, balabanJ, PEOE_RPC+, PEOE_VSA+3, PEOE_VSA+4, PEOE_VSA+5, PEOE_VSA+6, PEOE_VSA-1, PEOE_VSA-4, PEOE_VSA_FPNEG, PEOE_VSA_PPOS, PC+, PC-, Q_PC+, Q_RPC+, Q_VSA_FHYD, Q_VSA_FNEG, Q_VSA_FPNEG, Q_VSA_FPOL, Q_VSA_FPOS, Q_VSA_FPPOS, Q_VSA_PNEG, Q_VSA_PPOS, Kier1, Kier3, KierA1, KierA2, apol, vsa_acc, SlogP_VSA3, SlogP_VSA5, SMR_VSA3, SMR_VSA5, TPSA | radius, weinerPol, b_1rotR, b_rotR, chi1v_c, a_nO, a_nP, balabanJ, PEOE_VSA+2, PEOE_VSA+3, PEOE_VSA-1, PEOE_VSA-5, PEOE_VSA-6, Q_RPC+, SlogP_VSA1, SlogP_VSA4, SlogP_VSA9, SMR_VSA2, SMR_VSA4, SMR_VSA6, TPSA |
| 3D descriptors | AM1_dipole, AM1_Eele, E_sol, E_strain, E_tor, MNDO_HF, MNDO_dipole, MNDO_E, dipole, PM3_HF, ASA-, ASA_H, CASA-, FASA_H, FASA_P, VSA, glob, std_dim1, std_dim3, vol | E_oop, E_strain, E_vdw, PM3_LUMO, FASA_P, FCASA+, rgyr |
| Parameters | components = 49 | Cost = 5, ε = 0.21 |
Figure 3Performance of models developed with WAAC: (a) a PLS model and (b) an SVM model. The first two columns contain predictions for the training set and test set, respectively. The line x = y is shown for comparison. The column on the right shows the residuals from the test set prediction along with a line of best fit (light line); for comparison, the line x = 0 is shown (heavy line). Outliers are shown as filled circles in the test set prediction and residuals plots. All values in °C.
Summary statistics for the models discussed in the text
| WAAC/PLS | WAAC/SVM | SVM | Random Forest | ||
| RMSE (°C) | 44.4 | 30.7 | 36.2 | 47.6 | 17.8 (44.7)* |
| R2 | 0.52 | 0.77 | 0.68 | 0.44 | 0.92 (0.51)* |
| bias (°C) | 0.0 | -1.6 | -2.3 | -3.4 | 0.0 |
| RMSE (°C) | 46.6 | 45.1 | 43.9 | 48.3 | 44.5 |
| R2 | 0.51 | 0.54 | 0.56 | 0.47 | 0.55 |
| bias (°C) | -0.7 | -2.1 | -2.3 | -4.1 | -0.4 |
| mean (°C) | 166.5 | 165.2 | 165.0 | 163.2 | 167.0 |
| standard deviation (°C) | 47.1 | 51.6 | 49.3 | 49.5 | 41.0 |
| Slope | -0.49 | -0.43 | -0.44 | -0.49 | -0.53 |
* Out-of-bag estimates for RMSE and R2 are shown in parenthesis.
Figure 4Structures of outliers for the models discussed in the text. An outlier is defined as any molecule with a residual greater than four standard deviations from the mean. Molecules 41, 4161 and 4195 are outliers for the WAAC/SVM model; molecules 4161 and 4208 are outliers for both the RF and kNN models; molecule 4161 is the single outlier to the WAAC/PLS model.
Figure 5The effect of the number of components on the predictive ability of a PLS model. The red dashed line is a model based on all of the features, whereas the model represented by the blue solid line is based only on the subset selected by the WAAC algorithm. The best subset line ends at 59 components, as there are only 59 features in this subset. The line for all features is truncated at 174 components as the RMSE rapidly increases after this point.
Figure 6Performance of (a) a . The first two columns contain predictions for the training set and test set, respectively. The line x = y is shown for comparison. The column on the right shows the residuals from the test set prediction along with a line of best fit (light line); for comparison, the line x = 0 is shown (heavy line). Outliers are shown as filled circles in the test set prediction and residuals columns. All values in °C.
Figure 7Value of the objective function for the best PLS model at each iteration of (a) a genetic algorithm and (b) the WAAC algorithm. Ten repetitions of each algorithm are shown. The number of PLS components was set to 49.
Figure 8Relationship between the population size and the minimum value of the objective function for the WAAC/PLS model. The value of the objective function is the minimum found from ten repetitions of the algorithm.