| Literature DB >> 28467816 |
Qiang Su1, Wencong Lu2, Dongshu Du1,3, Fuxue Chen1, Bing Niu1,4, Kuo-Chen Chou4,5,6.
Abstract
Toxicity evaluation is an extremely important process during drug development. It is usually initiated by experiments on animals, which is time-consuming and costly. To speed up such a process, a quantitative structure-activity relationship (QSAR) study was performed to develop a computational model for correlating the structures of 581 aromatic compounds with their aquatic toxicity to tetrahymena pyriformis. A set of 68 molecular descriptors derived solely from the structures of the aromatic compounds were calculated based on Gaussian 03, HyperChem 7.5, and TSAR V3.3. A comprehensive feature selection method, minimum Redundancy Maximum Relevance (mRMR)-genetic algorithm (GA)-support vector regression (SVR) method, was applied to select the best descriptor subset in QSAR analysis. The SVR method was employed to model the toxicity potency from a training set of 500 compounds. Five-fold cross-validation method was used to optimize the parameters of SVR model. The new SVR model was tested on an independent dataset of 81 compounds. Both high internal consistent and external predictive rates were obtained, indicating the SVR model is very promising to become an effective tool for fast detecting the toxicity.Entities:
Keywords: QSAR; aromatic compounds; genetic algorithm; mRMR; tetrahymena pyriformis
Mesh:
Substances:
Year: 2017 PMID: 28467816 PMCID: PMC5564774 DOI: 10.18632/oncotarget.17210
Source DB: PubMed Journal: Oncotarget ISSN: 1949-2553
RMSE obtained by mRMR-GA-SVR method
| RMSE | Kernel function | Descriptors |
|---|---|---|
| 0.41 | Linear kernel | ΔE, logP, 2χ, 3χc, 4χpc, 3χv, 1κa, Φ, B, NHal |
| 0.38 | Polynomial kernel | LUMO, ΔE, MW, logP, NHal, NHdon |
| 0.38 | Gauss (RBF) kernel | LUMO, ΔE, MW, logP, 1χv, 3χc, 4χpc, 4χpcv, 1κa, NHdon |
Figure 1RMSE vs.ε in 5-fold CV using polynomial kernel function (C=2.3)
Figure 2RMSE vs. C in 5-fold CV using polynomial kernel function (ε =0.11)
RMSE, R, and Q for logIGC50−1 obtained by training set and external test set using different models
| Method | Training set | Test set | ||||
|---|---|---|---|---|---|---|
| RMSE | RMSE | |||||
| SVR | 500 | 0.38 | 0.84 | 81 | 0.44 | 0.77 |
| PLS | 500 | 0.42 | 0.78 | 81 | 0.50 | 0.68 |
| ANN | 500 | 0.40 | 0.82 | 81 | 0.46 | 0.76 |
Figure 3Plot of the experimental vs. predicted logIGC50−1 values by the SVR model
RMSE and Q2, logIGC50−1 of the training set and external test set of aromatic compounds using different descriptor subsets
| Descriptor | Training set | Test set | ||
|---|---|---|---|---|
| RMSE | RMSE | |||
| LUMO, ΔE, MW, logP, NHal, NHdon | 0.38 | 0.84 | 0.44 | 0.77 |
| ΔE, MW, logP, NHal, NHdon | 0.43 | 0.82 | 0.46 | 0.73 |
| LUMO, MW, logP, NHal, NHdon | 0.43 | 0.82 | 0.46 | 0.73 |
| LUMO, ΔE, logP, NHal, NHdon | 0.53 | 0.69 | 0.66 | 0.53 |
| LUMO, ΔE, MW, NHal, NHdon | 0.55 | 0.69 | 0.64 | 0.56 |
| LUMO, ΔE, MW, logP, NHdon | 0.44 | 0.82 | 0.47 | 0.74 |
| LUMO, ΔE, MW, logP, NHal | 0.45 | 0.82 | 0.46 | 0.73 |
Figure 4logIGC50−1 vs LUMO by SA
Figure 9logIGC50−1 vs NHdon by SA
Figure 5logIGC50−1 vs ΔE by SA
Figure 7logIGC50−1 vs logP by SA
Molecular descriptors and the obtaining methods
| Software | Descriptors |
|---|---|
| Gaussian 03 | HOMO energy, LUMO energy, the HOMO-LUMO gap (ΔE), the total molecular energy (ETot), the minimum (QNmax) and the maximum (QPmax) atomic partial charge, dipole moment (μ), polarizability (α) |
| HyperChem release 7.5 | Heat of formation (HF), molecular surface area (MSA), molecular volume (MVol), logarithm of the octanol-water partition coefficient (logP), hydration energy (HE), molecular refractivity (MR) |
| TSAR V3.3 | Molecular weight (MW); Kier and Hall simple and valence-corrected molecular connectivity indices (χ); Kappa shape indices (κ); shape flexibility (Φ); Wiener, Randic and Balaban topological indices; E-state indice (S); the number of H-bond donors (NHdon) and acceptors (NHacc); atom counts (oxygen, nitrogen, fluorine, chlorine, bromine, iodine, halogen atoms, heteroatoms); group counts (hydroxyl, amino, aldehyde, nitro, cyano, acid anhydride, methyl) |
Parameters of the GA-SVR feature selection
| Parameter | Value | Parameter | Value |
|---|---|---|---|
| Population Size | 50 | Regression method | SVR |
| Maximum generations | 100 | Cross-validation | 5-fold |
| Probability of crossover | 0.75 | Fitness function | |
| Probability of mutation | 0.01 | Regularization parameter ( | 10 |