| Literature DB >> 23299552 |
Mariusz Butkiewicz1, Edward W Lowe, Ralf Mueller, Jeffrey L Mendenhall, Pedro L Teixeira, C David Weaver, Jens Meiler.
Abstract
With the rapidly increasing availability of High-Throughput Screening (HTS) data in the public domain, such as the PubChem database, methods for ligand-based computer-aided drug discovery (LB-CADD) have the potential to accelerate and reduce the cost of probe development and drug discovery efforts in academia. We assemble nine data sets from realistic HTS campaigns representing major families of drug target proteins for benchmarking LB-CADD methods. Each data set is public domain through PubChem and carefully collated through confirmation screens validating active compounds. These data sets provide the foundation for benchmarking a new cheminformatics framework BCL::ChemInfo, which is freely available for non-commercial use. Quantitative structure activity relationship (QSAR) models are built using Artificial Neural Networks (ANNs), Support Vector Machines (SVMs), Decision Trees (DTs), and Kohonen networks (KNs). Problem-specific descriptor optimization protocols are assessed including Sequential Feature Forward Selection (SFFS) and various information content measures. Measures of predictive power and confidence are evaluated through cross-validation, and a consensus prediction scheme is tested that combines orthogonal machine learning algorithms into a single predictor. Enrichments ranging from 15 to 101 for a TPR cutoff of 25% are observed.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23299552 PMCID: PMC3759399 DOI: 10.3390/molecules18010735
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Overview of PubChem biological assays and data set statistics.
| Protein Target Class | Protein Target | PubChem Summary Assay ID SAID | Number Actives | NumberInactives | Hit Rate | Inactives -to-ActivesRatio | |
|---|---|---|---|---|---|---|---|
|
| |||||||
| Orexin1 Receptor | 435008 | 230 | 218,071 | 0.11% | 948 | ||
| M1 Muscarinic Receptor | 1798 | 188 | 61,661 | 0.30% | 327 | ||
| M1 Muscarinic Receptor | 435034 | 448 | 61,407 | 0.73% | 138 | ||
|
| |||||||
| Potassium Ion Channel Kir2.1 | 1843 | 172 | 301,473 | 0.06% | 1,752 | ||
| KCNQ2 potassium channel | 2258 | 213 | 302,351 | 0.07% | 1,419 | ||
| Cav3 T-type Calcium Channels | 463087 | 703 | 100,210 | 0.70% | 143 | ||
|
| |||||||
| Choline Transporter | 488997 | 252 | 302,246 | 0.08% | 1,199 | ||
|
| |||||||
| Serine/Threonine Kinase 33 | 2689 | 172 | 319,821 | 0.05% | 1,859 | ||
|
| |||||||
| Tyrosyl-DNA Phosphodiesterase | 485290 | 292 | 344,477 | 0.08% | 1,179 | ||
Descriptor selection results for Information Gain (IG), F-Score (FS), and Sequential Feature Forward Selection (SFFS) applied to each ML technique paired with each PubChem HTS data set. Results for the integral of the TNR-TPR curve are presented. The mean and standard deviation for each SAID (row) and each descriptor selection approach (column) is given.
| ANN | SVM | DT | KN | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PubChem | IG | FS | SF | IG | FS | SF | IG | FS | SF | IG | FS | SF | Mean |
| SAID | FS | FS | FS | FS | (Stdev) | ||||||||
| 435008 | 0.79 | 0.81 | 0.80 | 0.84 | 0.84 | 0.85 | 0.77 | 0.77 | 0.77 | 0.77 | 0.77 | 0.77 | 0.80 (0.03) |
| 1798 | 0.68 | 0.68 | 0.64 | 0.74 | 0.73 | 0.76 | 0.72 | 0.68 | 0.52 | 0.70 | 0.72 | 0.68 | 0.69 (0.06) |
| 435034 | 0.79 | 0.80 | 0.80 | 0.83 | 0.82 | 0.85 | 0.74 | 0.74 | 0.50 | 0.75 | 0.76 | 0.78 | 0.76 (0.09) |
| 2258 | 0.80 | 0.80 | 0.83 | 0.84 | 0.84 | 0.85 | 0.78 | 0.75 | 0.77 | 0.75 | 0.76 | 0.79 | 0.80 (0.04) |
| 1843 | 0.91 | 0.90 | 0.92 | 0.92 | 0.91 | 0.92 | 0.86 | 0.84 | 0.83 | 0.86 | 0.83 | 0.86 | 0.88 (0.04) |
| 463087 | 0.84 | 0.86 | 0.86 | 0.88 | 0.89 | 0.89 | 0.82 | 0.81 | 0.82 | 0.75 | 0.77 | 0.81 | 0.83 (0.05) |
| 488997 | 0.76 | 0.75 | 0.75 | 0.79 | 0.81 | 0.82 | 0.77 | 0.74 | 0.75 | 0.74 | 0.75 | 0.76 | 0.77 (0.03) |
| 2689 | 0.92 | 0.92 | 0.92 | 0.92 | 0.93 | 0.93 | 0.88 | 0.86 | 0.86 | 0.88 | 0.86 | 0.85 | 0.89 (0.03) |
| 485290 | 0.83 | 0.84 | 0.85 | 0.86 | 0.86 | 0.86 | 0.82 | 0.84 | 0.76 | 0.80 | 0.80 | 0.75 | 0.82 (0.04) |
| Mean (Stdev) | 0.81 (0.07) | 0.82 (0.07) | 0.82 (0.09) | 0.85 (0.06) | 0.85 (0.06) | 0.86 (0.05) | 0.79 (0.05) | 0.78 (0.06) | 0.73 (0.13) | 0.78 (0.06) | 0.78 (0.04 | 0.78 (0.05) | |
The top two ranking (#r) consensus QSAR models are presented for each PubChem data set (SAID) and descriptor selection techniques: F-Score (FS), Information gain (IG), and sequential feature forward selection (SFFS). Every model is evaluated by the integral of the TNR-TPR curve with a TPR rate of 0.00 to 0.25 (INT). The ranking is ordered by Enrichment (ENR). The best single predictor is shown for comparison. The ENR difference of the best consensus predictor compared to the best single predictor normalized by the inactives-to-actives ratio is given (Diff).
| SAID | FS | #r | INT | ENR | IG | #r | INT | ENR | SFFS | #r | INT | ENR |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 435008 | ANN DT KN SVM | 1 | 0.249 | 27 | SVM | 1 | 0.245 | 24 | DT SVM | 1 | 0.247 | 40 |
| SVM | 2 | 0.245 | 26 | DT SVM | 2 | 0.245 | 22 | SVM | 2 | 0.246 | 39 | |
| DT SVM | 3 | 0.245 | 26 | KN SVM | 3 | 0.245 | 22 | ANN DT SVM | 3 | 0.249 | 27 | |
| Diff | - | - | 0.11% | Diff | - | - | 0.21% | Diff | - | - | 0.11% | |
| 1798 | ANN DT KN SVM | 1 | 0.240 | 10 | ANN DT KN SVM | 1 | 0.241 | 15 | ANN DT KN SVM | 1 | 0.243 | 7 |
| ANN DT SVM | 2 | 0.240 | 9 | ANN KN SVM | 2 | 0.241 | 10 | ANN KN SVM | 2 | 0.243 | 7 | |
| SVM | 6 | 0.240 | 8 | SVM | 8 | 0.242 | 7 | SVM | 8 | 0.244 | 5 | |
| Diff | - | - | 0.61% | Diff | - | - | 2.45% | Diff | - | - | 0.61% | |
| 435034 | ANN SVM | 1 | 0.246 | 18 | ANN SVM | 1 | 0.245 | 17 | ANN SVM | 1 | 0.246 | 18 |
| ANN DT SVM | 2 | 0.246 | 17 | ANN DT SVM | 2 | 0.245 | 16 | ANN DT SVM | 2 | 0.246 | 18 | |
| SVM | 7 | 0.245 | 14 | SVM | 4 | 0.246 | 16 | SVM | 3 | 0.246 | 17 | |
| Diff | - | - | 2.90% | Diff | - | - | 0.72% | Diff | - | - | 0.72% | |
| 1843 | ANN DT KN | 1 | 0.249 | 54 | ANN DT SVM | 1 | 0.250 | 68 | ANN DT KN SVM | 1 | 0.250 | 50 |
| DT KN SVM | 2 | 0.249 | 45 | ANN DT KN SVM | 2 | 0.250 | 67 | ANN DT SVM | 2 | 0.250 | 45 | |
| SVM | 11 | 0.249 | 32 | SVM | 10 | 0.250 | 45 | ANN | 11 | 0.250 | 30 | |
| Diff | - | - | 1.26% | Diff | - | - | 1.31% | Diff | - | - | 1.14% | |
| 2258 | ANN DT | 1 | 0.241 | 28 | ANN DT | 1 | 0.241 | 34 | ANN DT KN | 1 | 0.244 | 65 |
| ANN DT SVM | 2 | 0.246 | 26 | ANN DT KN SVM | 2 | 0.246 | 33 | DT KN SVM | 2 | 0.249 | 49 | |
| SVM | 8 | 0.246 | 18 | DT | 6 | 0.238 | 23 | SVM | 10 | 0.249 | 28 | |
| Diff | - | - | 0.70% | Diff | - | - | 0.78% | - | - | 2.61% | ||
| 463087 | ANN SVM | 1 | 0.250 | 23 | ANN DT KN SVM | 1 | 0.250 | 19 | ANN KN SVM | 1 | 0.250 | 29 |
| SVM | 2 | 0.250 | 22 | ANN DT SVM | 2 | 0.250 | 18 | ANN DT KN SVM | 2 | 0.250 | 28 | |
| DT SVM | 4 | 0.250 | 21 | SVM | 8 | 0.250 | 17 | SVM | 8 | 0.250 | 24 | |
| Diff | - | - | 0.70% | Diff | - | - | 1.40% | Diff | - | - | 3.50% | |
| 2689 | ANN DT KN SVM | 1 | 0.248 | 74 | ANN DT KN SVM | 1 | 0.248 | 58 | ANN DT SVM | 1 | 0.249 | 101 |
| ANN DT SVM | 2 | 0.248 | 63 | ANN DT SVM | 2 | 0.248 | 54 | ANN DT KN SVM | 2 | 0.249 | 91 | |
| ANN | 10 | 0.250 | 42 | SVM | 10 | 0.248 | 41 | ANN | 10 | 0.248 | 44 | |
| Diff | - | - | 2.67% | Diff | - | - | 1.42% | Diff | - | - | 4.75% | |
| 488997 | ANN DT SVM | 1 | 0.246 | 20 | DT KN SVM | 1 | 0.244 | 14 | ANN DT KN SVM | 1 | 0.243 | 49 |
| ANN DT KN SVM | 2 | 0.247 | 19 | ANN DT KN | 2 | 0.241 | 13 | ANN KN SVM | 2 | 0.243 | 44 | |
| SVM | 6 | 0.245 | 15 | DT | 7 | 0.242 | 12 | SVM | 11 | 0.244 | 31 | |
| Diff | - | - | 0.27% | Diff | - | - | 0.11% | Diff | - | - | 0.97% | |
| 485290 | ANN DT KN | 1 | 0.241 | 64 | DT KN SVM | 1 | 0.245 | 71 | ANN SVM | 1 | 0.244 | 30 |
| DT KN SVM | 2 | 0.245 | 58 | ANN DT KN SVM | 2 | 0.246 | 60 | ANN DT SVM | 2 | 0.244 | 28 | |
| SVM | 11 | 0.244 | 38 | SVM | 12 | 0.245 | 36 | SVM | 4 | 0.244 | 26 | |
| Diff | - | - | 2.22% | Diff | - | - | 2.96% | Diff | - | - | 0.28% |