| Literature DB >> 24524735 |
Alexey V Zakharov1, Megan L Peach, Markus Sitzmann, Marc C Nicklaus.
Abstract
Many of the structures in PubChem are annotated with activities determined in high-throughput screening (HTS) assays. Because of the nature of these assays, the activity data are typically strongly imbalanced, with a small number of active compounds contrasting with a very large number of inactive compounds. We have used several such imbalanced PubChem HTS assays to test and develop strategies to efficiently build robust QSAR models from imbalanced data sets. Different descriptor types [Quantitative Neighborhoods of Atoms (QNA) and "biological" descriptors] were used to generate a variety of QSAR models in the program GUSAR. The models obtained were compared using external test and validation sets. We also report on our efforts to incorporate the most predictive of our models in the publicly available NCI/CADD Group Web services ( http://cactus.nci.nih.gov/chemical/apps/cap).Entities:
Mesh:
Substances:
Year: 2014 PMID: 24524735 PMCID: PMC3985743 DOI: 10.1021/ci400737s
Source DB: PubMed Journal: J Chem Inf Model ISSN: 1549-9596 Impact factor: 4.956
Characteristics of PubChem HTS Assays Used for QSAR Modeling
| AID | name | initial number | after preprocessing | active | inactive | ratio |
|---|---|---|---|---|---|---|
| 504466 | genotoxicity inductors in HEK293T cells | 330,115 | 310,403 | 4108 | 306,295 | 1:75 |
| 485314 | DNA polymerase beta inhibitors | 334,467 | 306,830 | 4348 | 302,482 | 1:70 |
| 485341 | AmpC beta-lactamase inhibitors | 330,683 | 285,970 | 1694 | 284,276 | 1:168 |
| 624202 | BRCA1 activators | 376,014 | 351,201 | 3902 | 347,299 | 1:89 |
| 651820 | hepatitis C virus inhibitors | 339,561 | 268,119 | 10,727 | 257,392 | 1:24 |
Test Set Prediction Quality Parameters for Each Imbalanced Learning Approach
| multiple under-sampling | threshold, ratio 1:3 under-sampling | one-sided
under-sampling | similarity under-sampling | cluster under-sampling | diversity under-sampling | only threshold selection | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| assay AID | BA | GM | BA | GM | BA | GM | BA | GM | BA | GM | BA | GM | BA | GM |
| 504466 | 0.84 | 0.84 | 0.83 | 0.83 | 0.80 | 0.80 | 0.76 | 0.76 | 0.82 | 0.82 | 0.72 | 0.72 | 0.77 | 0.75 |
| 485314 | 0.78 | 0.78 | 0.78 | 0.78 | 0.76 | 0.76 | 0.67 | 0.67 | 0.69 | 0.68 | 0.68 | 0.65 | 0.72 | 0.69 |
| 485341 | 0.70 | 0.70 | 0.69 | 0.69 | 0.66 | 0.66 | 0.52 | 0.52 | 0.61 | 0.58 | 0.58 | 0.51 | 0.50 | 0.06 |
| 624202 | 0.80 | 0.80 | 0.79 | 0.79 | 0.76 | 0.76 | 0.67 | 0.67 | 0.76 | 0.75 | 0.72 | 0.69 | 0.60 | 0.47 |
| 651820 | 0.78 | 0.78 | 0.77 | 0.77 | 0.75 | 0.75 | 0.70 | 0.70 | 0.71 | 0.70 | 0.69 | 0.68 | 0.75 | 0.74 |
| average | 0.78 | 0.78 | 0.77 | 0.77 | 0.74 | 0.74 | 0.66 | 0.66 | 0.72 | 0.71 | 0.68 | 0.65 | 0.67 | 0.54 |
BA: balanced accuracy. GM: geometric mean (G-mean).
Test Set Predictions Quality Parameters for Combined Imbalanced Learning Approaches
| combination
hybrid and cluster under-sampling | hybrid under-sampling | cluster under-sampling | ||||
|---|---|---|---|---|---|---|
| assay AID | BA | GM | BA | GM | BA | GM |
| 504466 | 0.85 | 0.85 | 0.83 | 0.83 | 0.82 | 0.82 |
| 485314 | 0.74 | 0.73 | 0.78 | 0.78 | 0.69 | 0.68 |
| 485341 | 0.67 | 0.66 | 0.69 | 0.69 | 0.61 | 0.58 |
| 624202 | 0.80 | 0.80 | 0.79 | 0.79 | 0.76 | 0.75 |
| 651820 | 0.75 | 0.75 | 0.77 | 0.77 | 0.71 | 0.70 |
| average | 0.76 | 0.76 | 0.77 | 0.77 | 0.72 | 0.71 |
BA – balanced accuracy, GM – geometric mean (G-mean).
External Validation of Imbalanced Methods
| combination
of hybrid and cluster under-sampling | hybrid under-sampling | multiple under-sampling | ||||
|---|---|---|---|---|---|---|
| assay AID | BA | GM | BA | GM | BA | GM |
| 651632 | 0.68 (AD: 100%) | 0.62 (AD: 100%) | 0.68 (AD: 100%) | 0.65 (AD: 100%) | 0.68 (AD: 100%) | 0.66 (AD: 100%) |
| 651632 (combined AD) | 0.68 (AD: 95.2%) | 0.63 (AD: 95.2%) | 0.69 (AD: 79.3%) | 0.66 (AD: 79.3%) | 0.69 (AD: 92.2%) | 0.67 (AD: 92.2%) |
AD: applicability domain. BA: balanced accuracy. GM: geometric mean (G-mean). Lower row: predictions limited to those that fall in the combined applicability domains of the hybrid and cluster combination methods.