| Literature DB >> 21541057 |
Ming Hao1, Yan Li, Yonghua Wang, Shuwei Zhang.
Abstract
Experimental pEC(50)s for 216 selective respiratory syncytial virus (RSV) inhibitors are used to develop classification models as a potential screening tool for a large library of target compounds. Variable selection algorithm coupled with random forests (VS-RF) is used to extract the physicochemical features most relevant to the RSV inhibition. Based on the selected small set of descriptors, four other widely used approaches, i.e., support vector machine (SVM), Gaussian process (GP), linear discriminant analysis (LDA) and k nearest neighbors (kNN) routines are also employed and compared with the VS-RF method in terms of several of rigorous evaluation criteria. The obtained results indicate that the VS-RF model is a powerful tool for classification of RSV inhibitors, producing the highest overall accuracy of 94.34% for the external prediction set, which significantly outperforms the other four methods with the average accuracy of 80.66%. The proposed model with excellent prediction capacity from internal to external quality should be important for screening and optimization of potential RSV inhibitors prior to chemical synthesis in drug development.Entities:
Keywords: Mold2 descriptors; RSV; random forest; variable selection
Mesh:
Substances:
Year: 2011 PMID: 21541057 PMCID: PMC3083704 DOI: 10.3390/ijms12021259
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1.Self-organizing map (SOM) top map indicating the distribution of the training and external prediction sets. The training set is labeled in black font and the prediction set in red font. The number corresponds to the series number of the compounds of the RSV inhibitors.
The selected 6 Mold2 descriptors using variable selection algorithm coupled with random forests (VS-RF) and their definition.
| D299 | The largest eigenvalue | Eigenvalue-based indices |
| D347 | Molecular topological path index of order 07 | Walk and path counts |
| D490 | Moran topological structure autocorrelation length-4 weighted by atomic van der Waals volumes | 2D autocorrelation |
| D503 | Moran topological structure autocorrelation length-1 weighted by atomic polarizabilities | 2D autocorrelation |
| D513 | Molecular topological order-3 charge index | Topological charge indices |
| D528 | Mean molecular topological order-8 charge index | Topological charge indices |
The prediction performance of high and low active compounds as respiratory syncytial virus (RSV) inhibitors from VS-RF, SVM, GP, LDA and kNN statistical methods for the external prediction set and the 10-fold cross-validationa.
| VS-RF | 27 | 0 | 100 | 23 | 3 | 88.46 | 94.34 | 0.89 | 0.96 | 81.6 |
| SVM | 23 | 4 | 85.19 | 21 | 5 | 80.77 | 83.02 | 0.66 | 0.84 | 79.1 |
| GP | 27 | 0 | 100 | 20 | 6 | 76.92 | 88.68 | 0.79 | 0.9 | 78 |
| LDA | 20 | 7 | 74.07 | 21 | 5 | 80.77 | 77.36 | 0.55 | 0.77 | 67.5 |
| 22 | 5 | 81.48 | 17 | 9 | 65.38 | 73.58 | 0.48 | 0.76 | 72.9 | |
VS-RF, mtry = 4; SVM, C = 10, sigma = 0.284; GP, sigma = 0.284; kNN, k = 17; TP, true positives; FN, false negatives; SE, sensitivity; TN, true negatives; FP, false positives; SP, specificity; Q, the overall prediction accuracy; MCC, Matthews correlation coefficient; F, F-measure; Qcv, the prediction accuracy from 10-fold cross-validation for the training set.
Figure 2.The ROC (receiver operating characteristic) curves of VS-RF, SVM, GP, LDA and kNN for the prediction set.
Comparison of random forest (RF) statistical performance with and without variable selection based on the respiratory syncytial virus (RSV) inhibitor dataset a.
| Training set | RF | 82 | 0 | 100 | 81 | 0 | 100 | 100 | 0.816 | 171.42 |
| VS-RF | 82 | 0 | 100 | 81 | 0 | 100 | 100 | 0.816 | 8.06 | |
| Test set | RF | 25 | 2 | 92.59 | 23 | 3 | 88.46 | 90.57 | - | - |
| VS-RF | 27 | 0 | 100 | 23 | 3 | 88.46 | 94.34 | - | - | |
for RF, mtry = 62; for VS-RF, mtry = 4; TP, true positives; FN, false negatives; SE, sensitivity; TN, true negatives; FP, false positives; SP, specificity; Q, the overall prediction accuracy; MCC, Matthews correlation coefficient; F, F-measure; Qcv, the prediction accuracy from 10-fold cross-validation for the training set.
Representative compounds with their chemical names, activities and classes used in the dataset.
| 1 | 4.507 | L | 12 | |
| 2 | 6.328 | L | 12 | |
| 3 | 5.174 | L | 12 | |
| 4[ | 6.222 | L | 12 | |
| 5 | 5.959 | L | 12 | |
| 7 | 5.959 | L | 12 | |
| 8[ | 4.81 | L | 12 | |
| 9 | 5.481 | L | 12 | |
| 10 | 5.114 | L | 12 | |
| 11 | 5.570 | L | 12 | |
| 12[ | 6.284 | L | 12 | |
| 29 | 6.125 | L | 13 | |
| 30 | 8.398 | H | 13 | |
| 31 | 7.959 | H | 13 | |
| 32[ | 7.796 | H | 13 | |
| 34 | 7.602 | H | 13 | |
| 35 | 7.745 | H | 13 | |
| 36 | 7.921 | H | 13 | |
| 37 | 7.678 | H | 13 | |
| 38 | 8.046 | H | 13 | |
| 39[ | 8.000 | H | 13 | |
| 41 | 7.959 | H | 13 | |
| 42[ | 7.854 | H | 13 | |
| 43 | 7.824 | H | 13 |
*, test set;
from the corresponding reference;
H denotes high active compounds, L denotes low active compounds.