| Literature DB >> 21214224 |
Tiejun Cheng1, Qingliang Li, Yanli Wang, Stephen H Bryant.
Abstract
Aqueous solubility is recognized as a critical parameter in both the early- and late-stage drug discovery. Therefore, in silico modeling of solubility has attracted extensive interests in recent years. Most previous studies have been limited in using relatively small data sets with limited diversity, which in turn limits the predictability of derived models. In this work, we present a support vector machines model for the binary classification of solubility by taking advantage of the largest known public data set that contains over 46 000 compounds with experimental solubility. Our model was optimized in combination with a reduction and recombination feature selection strategy. The best model demonstrated robust performance in both cross-validation and prediction of two independent test sets, indicating it could be a practical tool to select soluble compounds for screening, purchasing, and synthesizing. Moreover, our work may be used for comparative evaluation of solubility classification studies ascribe to the use of completely public resources.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21214224 PMCID: PMC3047290 DOI: 10.1021/ci100364a
Source DB: PubMed Journal: J Chem Inf Model ISSN: 1549-9596 Impact factor: 4.956
Data Sets Used in This Study
| data set | type | total compounds | soluble compounds | insoluble compounds | soluble/insoluble ratio |
|---|---|---|---|---|---|
| I | training set | 41 501 | 28 921 | 12 580 | 2.30: 1 |
| II | internal test set | 4510 | 3177 | 1333 | 2.38: 1 |
| III | external test set | 32 | 25 | 7 | 3.57: 1 |
Figure 1Six additional physicochemical properties (ADD6) used in this study. The box plot shows the minimum, lower quartile (Q1), median (Q2), upper quartile (Q3), and maximum of each property. MW: molecular weight; HBD: number of hydrogen-bond donors; HBA: number of hydrogen-bond acceptors; ROTB: number of rotatable bonds; CPLX: molecular complexity; and TPSA: topological polar surface area. The properties of training set (data set I) are suffixed with I, while those of two test sets (data set II and III) are suffixed with II and III, respectively. Note that the statistics for all properties have been increased by one to fit in the logarithmic coordination, because the minimal values of some properties (e.g., HBD) are zeros, which would become infinity in the logarithmic scale.
The 10-Fold Cross-Validation Using G-Mean as a Metric for SVM Models with Default Parameters
| SVM (%) | ||
|---|---|---|
| fingerprint | linear kernel | RBF kernel |
| MACCS166 | 69.7 (53.3) | 70.9 (51.0) |
| PC881 | 76.6 (70.6) | 74.7 (63.2) |
| MACCS166 + ADD6 | 72.8 | |
| PC881 + ADD6 | 74.8 | |
| PC881 + MACCS166 | 75.6 | |
| PC881 + MACCS166 + ADD6 | 75.7 | |
| PC307 + MACCS90 + ADD5 | 75.6 | |
MACCS166: the MDL MACCS 166 keys; PC881: the PubChem fingerprint; ADD6: the six additional physicochemical properties described in Figure 1; and PC307, MACCS90, and ADD5 are the truncated versions of their parent fingerprints whose component features have F-scores above 0.001. The trailing digit indicates the length of the corresponding fingerprint.
The number inside the parentheses is generated by the SVM model in which data imbalance has not been considered.
Relevant metrics for SVM models with linear kernel have not been calculated for the last five fingerprints.
Prediction of Independent Test Sets
| soluble compounds | insoluble compounds | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| data set | model | TP | FN | sensitivity (%) | FNR (%) | TN | FP | specificity (%) | FPR (%) | precision (%) | recall (%) | accuracy (%) | G-mean (%) |
| II ( | SVM | 2622 | 555 | 82.5 | 17.5 | 1115 | 218 | 83.6 | 16.4 | 92.3 | 82.5 | 82.9 | 83.1 |
| SVM | 2705 | 472 | 85.1 | 14.9 | 1084 | 249 | 81.3 | 18.7 | 91.6 | 85.1 | 84.0 | 83.2 | |
| III ( | SVM | 22 | 3 | 88.0 | 12.0 | 2 | 5 | 28.6 | 71.4 | 81.5 | 88.0 | 75.0 | 50.1 |
| SVM | 22 | 3 | 88.0 | 12.0 | 3 | 4 | 42.9 | 57.1 | 84.6 | 88.0 | 78.1 | 61.4 | |
| SVM | 19 | 3 | 86.4 | 13.6 | 2 | 4 | 33.3 | 66.7 | 82.6 | 86.4 | 75.0 | 53.6 | |
| ASM-ATC-LOGP | 24 | 1 | 96.0 | 4.0 | 3 | 4 | 42.9 | 57.1 | 85.7 | 96.0 | 84.4 | 64.1 | |
| MLR | 21 | 4 | 84.0 | 16.0 | 5 | 2 | 71.4 | 28.6 | 91.3 | 84.0 | 81.2 | 77.5 | |
| ANN | 24 | 1 | 96.0 | 4.0 | 4 | 3 | 57.1 | 42.9 | 88.9 | 96.0 | 87.5 | 74.1 | |
| category | 24 | 1 | 96.0 | 4.0 | 2 | 5 | 28.6 | 71.4 | 82.8 | 96.0 | 81.2 | 52.4 | |
| ChemSilico | 24 | 1 | 96.0 | 4.0 | 1 | 6 | 14.3 | 85.7 | 80.0 | 96.0 | 78.1 | 37.0 | |
| optibrium | 24 | 1 | 96.0 | 4.0 | 3 | 4 | 42.9 | 57.1 | 85.7 | 96.0 | 84.4 | 64.1 | |
| pharma algorithms | 24 | 1 | 96.0 | 4.0 | 1 | 6 | 14.3 | 85.7 | 80.0 | 96.0 | 78.1 | 37.0 | |
| Simulations Plus | 22 | 3 | 88.0 | 12.0 | 3 | 4 | 42.9 | 57.1 | 84.6 | 88.0 | 78.1 | 61.4 | |
| original consensus | 23 | 2 | 92.0 | 8.0 | 2 | 5 | 28.6 | 71.4 | 82.1 | 92.0 | 78.1 | 51.3 | |
| SPARC | 15 | 10 | 60.0 | 40.0 | 6 | 1 | 85.7 | 14.3 | 93.7 | 60.0 | 65.6 | 71.7 | |
TP: true positive; FN: false negative; and FNR: false negative rate = FN/(FN + TP).
TN: true negative; FP: false positive; and FPR: false positive rate = FP/(FP + TN).
Model is based on the selected feature set, i.e., PC307 + MACCS90 + ADD5.
Model is based on the complete feature set, i.e., PC881 + MACCS166 + ADD6.
Results are based on a clean version of data set III by removing the four common samples in data set I and III.
Data are cited from ref (16).
Data are cited from the Supporting Information of ref (55).
Top 10 Features That Contribute Most to Classification
Example fragment of respective SMARTS is depicted with red.
Figure 2Diversity analysis of data sets I−III and the 100 compounds from the training set of the solubility prediction challenge (SPC100).(34) (A) Chemical space defined by molecular weight and TPSA. Note that one data point (1139.8, 133) from data set II is not included in this figure. (B) Distribution of solubility in a chemical space defined by molecular weight and TPSA. Both figures use the same color scheme.