| Literature DB >> 31709231 |
Ruifeng Liu1,2, Mohamed Diwan M AbdulHameed1,2, Anders Wallqvist1.
Abstract
High throughput screening (HTS) is an important component of lead discovery, with virtual screening playing an increasingly important role. Both methods typically suffer from lack of sensitivity and specificity against their true biological targets. With ever-increasing screening libraries and virtual compound collections, it is now feasible to conduct follow-up experimental testing on only a small fraction of hits. In this context, advances in virtual screening that achieve enrichment of true actives among top-ranked compounds ("early recognition") and, hence, reduce the number of hits to test, are highly desirable. The standard ligand-based virtual screening method for large compound libraries uses a molecular similarity search method that ranks the likelihood of a compound to be active against a drug target by its highest Tanimoto similarity to known active compounds. This approach assumes that the distributions of Tanimoto similarity values to all active compounds are identical (i.e., same mean and standard deviation)-an assumption shown to be invalid (Baldi and Nasr, 2010). Here, we introduce two methods that improve early recognition of actives by exploiting similarity information of all molecules. The first method ranks a compound by its highest z-score instead of its highest Tanimoto similarity, and the second by an aggregated score calculated from its Tanimoto similarity values to all known actives and inactives (or a large number of structurally diverse molecules when information on inactives is unavailable). Our evaluations, which use datasets of over 20 HTS campaigns downloaded from PubChem, indicate that compared to the conventional approach, both methods achieve a ~10% higher Boltzmann-enhanced discrimination of receiver operating characteristic (BEDROC) score-a metric of early recognition. Given the increasing use of virtual screening in early lead discovery, these methods provide straightforward means to enhance early recognition.Entities:
Keywords: BEDROC; ROCS; Tanimoto similarity; early recognition; lead discovery; virtual screening; z-score
Year: 2019 PMID: 31709231 PMCID: PMC6819673 DOI: 10.3389/fchem.2019.00701
Source DB: PubMed Journal: Front Chem ISSN: 2296-2646 Impact factor: 5.221
Figure 1Examples of means and standard deviations (STD) of the Tanimoto coefficients (TCs) of 10,000 compounds randomly selected from the National Cancer Institute's virtual screening library calculated with respect to three drugs approved by the Food and Drug Administration.
PubChem datasets used in this study to evaluate performance of similarity search methods.
| AHR | 743122 | 8169 | 6318 | 723 |
| AR | 743040 | 9362 | 7009 | 290 |
| ARE | 743219 | 7167 | 5643 | 889 |
| AR-lbd | 743053 | 8599 | 6524 | 233 |
| Aromatase | 743139 | 7226 | 5601 | 273 |
| ATAD5 | 720516 | 9091 | 6825 | 253 |
| ER | 743079 | 7697 | 5993 | 716 |
| ER-lbd | 743077 | 8753 | 6727 | 324 |
| HSE | 743228 | 8150 | 6253 | 337 |
| MMP | 720637 | 7320 | 5625 | 885 |
| P53 | 720552 | 8634 | 6544 | 410 |
| PPARg | 743140 | 8184 | 6232 | 174 |
| 4-MU | 589 | 59070 | 58199 | 6146 |
| ALDH1A1 | 1030 | 220365 | 215450 | 15847 |
| BRCA1 | 624202 | 377534 | 373883 | 3938 |
| DNApb | 485314 | 337903 | 333082 | 4466 |
| ERK | 1454 | 133383 | 130623 | 532 |
| GCN5L2 | 504327 | 387577 | 379179 | 741 |
| hERG | 588834 | 5363 | 4568 | 553 |
| Lucif | 411 | 72335 | 70939 | 1558 |
| MiRNAs | 2289 | 336623 | 332205 | 3265 |
| Mitoch | 485298 | 322909 | 320471 | 734 |
| NPC1 | 485313 | 321376 | 319001 | 7532 |
| PR901 | 1347036 | 9523 | 7177 | 111 |
All datasets are derived from quantitative high throughput screening conducted at the National Center for Advancing Translational Sciences to ascertain chemical activities against different molecular targets. The first 12 datasets were used in the 2014 Tox21 Data Challenge.
The datasets can be accessed from the PubChem website using the assay IDs as queries.
Total number of samples screened in each dataset.
Number of structurally unique parent molecules (non-salts, non-mixtures) derived from retaining the largest chemical structure in each sample and performing structure standardization.
Number of structurally unique active parent molecules.
Dataset names: AHR, activators of aryl hydrocarbon receptor; AR, activators of androgen receptor; AR-lbd, activators of androgen receptor ligand binding domain; Aromatase, aromatase inhibitors; ER, estrogen receptor activators; ER-lbd, activators of estrogen receptor ligand binding domain; PPARg, activators of peroxisome proliferator-activated receptor gamma; ARE, activators of antioxidant response element; ATAD5, ATPase family AAA domain-containing protein 5; HSE, activators of heat shock response signaling pathway; MMP, disruptors of mitochondrial membrane potential; p53, activators of p53 signaling pathway; hERG, blockers of hERG potassium channel; PR901, agonists of progesterone receptor; 4-MU, spectroscopic response at the 4-methylumbelliferone region as a counter assay for fluorescence detection; Lucif, inhibitors of Luciferase; ERK, inhibitors of mitogen-activated protein kinase 1; ALDH1A1, inhibitors of aldehyde dehydrogenase 1 family, member A1; NPC1, promoters of Niemann-Pick C1 protein precursor; Mitoch, inhibitors of mitochondrial division; MiRNAs, modulators of miRNAs; DNApb, inhibitors of DNA polymerase beta; BRCA1, activators of BRCA1 expression; GCN5L2, inhibitors of histone acetyltransferase KAT2A.
Mean and standard deviation of ROC_AUC and BEDROC values derived from a similarity search using the rank by maximum similarity (Max-Sim) and maximum z-score (maxZ) approaches over 10 runs, each with 100 randomly selected actives as queries.
| AHR | 0.754 | 0.011 | 0.759 | 0.012 | 0.7 | 0.402 | 0.016 | 0.365 | 0.019 | −9.3 |
| AR | 0.740 | 0.009 | 0.748 | 0.009 | 1.0 | 0.482 | 0.019 | 0.509 | 0.014 | 5.6 |
| ARE | 0.539 | 0.008 | 0.544 | 0.009 | 1.0 | 0.225 | 0.014 | 0.266 | 0.018 | 17.9 |
| AR-lbd | 0.815 | 0.011 | 0.811 | 0.011 | −0.5 | 0.607 | 0.023 | 0.610 | 0.028 | 0.6 |
| Aromatase | 0.662 | 0.017 | 0.686 | 0.016 | 3.6 | 0.263 | 0.026 | 0.291 | 0.019 | 10.7 |
| ATAD5 | 0.713 | 0.022 | 0.724 | 0.018 | 1.6 | 0.303 | 0.025 | 0.313 | 0.020 | 3.4 |
| ER | 0.665 | 0.005 | 0.667 | 0.009 | 0.4 | 0.380 | 0.009 | 0.395 | 0.012 | 3.8 |
| ER-lbd | 0.715 | 0.011 | 0.726 | 0.010 | 1.5 | 0.381 | 0.026 | 0.382 | 0.025 | 0.0 |
| HSE | 0.579 | 0.027 | 0.595 | 0.027 | 2.8 | 0.186 | 0.020 | 0.205 | 0.023 | 10.3 |
| MMP | 0.694 | 0.009 | 0.704 | 0.018 | 1.4 | 0.414 | 0.024 | 0.469 | 0.025 | 13.2 |
| P53 | 0.611 | 0.013 | 0.649 | 0.013 | 6.1 | 0.251 | 0.016 | 0.265 | 0.012 | 5.8 |
| PPARg | 0.681 | 0.018 | 0.686 | 0.027 | 0.7 | 0.274 | 0.031 | 0.277 | 0.030 | 1.0 |
| 4-MU | 0.565 | 0.007 | 0.604 | 0.007 | 7.0 | 0.272 | 0.010 | 0.316 | 0.009 | 16.1 |
| ALDH1A1 | 0.506 | 0.004 | 0.513 | 0.009 | 1.3 | 0.104 | 0.007 | 0.111 | 0.007 | 6.8 |
| BRCA1 | 0.667 | 0.006 | 0.694 | 0.004 | 4.2 | 0.147 | 0.006 | 0.155 | 0.006 | 5.5 |
| DNApb | 0.591 | 0.011 | 0.633 | 0.012 | 7.1 | 0.137 | 0.006 | 0.163 | 0.012 | 18.9 |
| ERK | 0.647 | 0.021 | 0.698 | 0.017 | 8.0 | 0.235 | 0.017 | 0.250 | 0.016 | 6.3 |
| GCN5L2 | 0.541 | 0.014 | 0.651 | 0.016 | 20.5 | 0.179 | 0.015 | 0.245 | 0.015 | 36.9 |
| hERG | 0.745 | 0.009 | 0.732 | 0.013 | −1.8 | 0.460 | 0.009 | 0.447 | 0.012 | −2.7 |
| Lucif | 0.707 | 0.008 | 0.737 | 0.010 | 4.1 | 0.255 | 0.010 | 0.268 | 0.012 | 4.9 |
| MiRNAs | 0.574 | 0.004 | 0.609 | 0.010 | 6.1 | 0.128 | 0.004 | 0.144 | 0.008 | 12.0 |
| Mitoch | 0.510 | 0.006 | 0.546 | 0.007 | 7.0 | 0.079 | 0.007 | 0.101 | 0.006 | 28.9 |
| NPC1 | 0.653 | 0.007 | 0.696 | 0.007 | 6.5 | 0.079 | 0.007 | 0.213 | 0.006 | 170.0 |
| PR901 | 0.931 | 0.013 | 0.945 | 0.013 | 1.5 | 0.800 | 0.025 | 0.822 | 0.022 | 2.8 |
Percent difference between mean ROC_AUC values for Max-Sim and maxZ methods.
Percent difference between mean BEDROC values for Max-Sim and maxZ methods.
Mean and standard deviation of ROC_AUC and BEDROC values derived from a ROCS-based 3-D molecular similarity search using the rank by maximum similarity (Max-Sim) and maximum z-score (maxZ) methods.
| AHR | 0.588 | 0.004 | 0.603 | 0.004 | 2.5 | 0.112 | 0.003 | 0.126 | 0.003 | 12.7 |
| AR | 0.729 | 0.007 | 0.749 | 0.015 | 2.7 | 0.201 | 0.010 | 0.234 | 0.013 | 16.2 |
| ARE | 0.547 | 0.004 | 0.560 | 0.005 | 2.5 | 0.090 | 0.003 | 0.101 | 0.002 | 12.6 |
| AR-lbd | 0.645 | 0.020 | 0.661 | 0.023 | 2.4 | 0.168 | 0.014 | 0.208 | 0.012 | 23.3 |
| Aromatase | 0.562 | 0.011 | 0.579 | 0.013 | 3.1 | 0.068 | 0.005 | 0.080 | 0.004 | 17.2 |
| ATAD5 | 0.663 | 0.010 | 0.686 | 0.012 | 3.5 | 0.164 | 0.009 | 0.183 | 0.009 | 11.5 |
| ER | 0.685 | 0.007 | 0.703 | 0.009 | 2.6 | 0.168 | 0.007 | 0.184 | 0.005 | 10.1 |
| ER-lbd | 0.679 | 0.007 | 0.697 | 0.008 | 2.8 | 0.181 | 0.006 | 0.201 | 0.006 | 11.0 |
| HSE | 0.526 | 0.011 | 0.538 | 0.012 | 2.3 | 0.078 | 0.005 | 0.085 | 0.005 | 8.7 |
| MMP | 0.553 | 0.002 | 0.568 | 0.003 | 2.6 | 0.119 | 0.003 | 0.127 | 0.001 | 7.0 |
| P53 | 0.575 | 0.009 | 0.590 | 0.011 | 2.5 | 0.091 | 0.006 | 0.101 | 0.004 | 11.2 |
| PPARg | 0.569 | 0.016 | 0.592 | 0.020 | 3.9 | 0.104 | 0.010 | 0.114 | 0.008 | 9.7 |
Percent difference between mean ROC_AUC values for Max-Sim and maxZ methods.
Percent difference between mean BEDROC values for Max-Sim and maxZ methods.
Mean and standard deviation of ROC_AUC and BEDROC values derived from a fingerprint-based similarity search using the rank by maximum similarity (Max-Sim) and rank by aggregated score (AS) methods.
| AHR | 0.734 | 0.010 | 0.829 | 0.010 | 13.0 | 0.385 | 0.016 | 0.534 | 0.018 | 38.6 |
| AR | 0.749 | 0.013 | 0.809 | 0.013 | 8.0 | 0.443 | 0.020 | 0.618 | 0.019 | 39.7 |
| ARE | 0.537 | 0.008 | 0.650 | 0.011 | 21.0 | 0.264 | 0.021 | 0.404 | 0.015 | 53.2 |
| AR-lbd | 0.795 | 0.017 | 0.823 | 0.014 | 3.5 | 0.574 | 0.026 | 0.605 | 0.022 | 5.5 |
| Aromatase | 0.615 | 0.016 | 0.727 | 0.021 | 18.2 | 0.219 | 0.029 | 0.323 | 0.019 | 47.3 |
| ATAD5 | 0.652 | 0.018 | 0.717 | 0.018 | 9.9 | 0.273 | 0.023 | 0.303 | 0.024 | 11.0 |
| ER | 0.602 | 0.024 | 0.683 | 0.024 | 13.6 | 0.234 | 0.016 | 0.439 | 0.039 | 87.7 |
| ER-lbd | 0.689 | 0.014 | 0.713 | 0.011 | 3.6 | 0.374 | 0.027 | 0.386 | 0.017 | 3.3 |
| HSE | 0.557 | 0.009 | 0.650 | 0.012 | 16.8 | 0.122 | 0.010 | 0.210 | 0.024 | 71.8 |
| MMP | 0.663 | 0.011 | 0.760 | 0.007 | 14.6 | 0.398 | 0.030 | 0.565 | 0.026 | 41.9 |
| P53 | 0.588 | 0.020 | 0.723 | 0.013 | 23.0 | 0.224 | 0.017 | 0.262 | 0.020 | 16.9 |
| PPARg | 0.678 | 0.029 | 0.744 | 0.025 | 9.7 | 0.262 | 0.036 | 0.279 | 0.036 | 6.3 |
Percent difference between mean ROC_AUC values for Max-Sim and AS methods.
Percent difference between mean BEDROC values for Max-Sim and AS methods.
Mean and standard deviation of ROC_AUC and BEDROC values derived from a fingerprint-based similarity search using the rank by maximum similarity (Max-Sim) and rank by aggregated score (AS) methods, using 10,000 structurally diverse compounds as inactive compounds.
| AHR | 0.730 | 0.010 | 0.758 | 0.009 | 3.8 | 0.337 | 0.023 | 0.407 | 0.017 | 20.8 |
| AR | 0.754 | 0.012 | 0.768 | 0.011 | 1.9 | 0.436 | 0.024 | 0.596 | 0.017 | 36.8 |
| ARE | 0.535 | 0.007 | 0.549 | 0.009 | 2.6 | 0.224 | 0.021 | 0.258 | 0.024 | 15.2 |
| AR-lbd | 0.790 | 0.018 | 0.807 | 0.021 | 2.2 | 0.550 | 0.020 | 0.614 | 0.030 | 11.6 |
| Aromatase | 0.621 | 0.024 | 0.663 | 0.023 | 6.7 | 0.202 | 0.040 | 0.289 | 0.036 | 43.2 |
| ATAD5 | 0.653 | 0.024 | 0.666 | 0.025 | 2.0 | 0.268 | 0.021 | 0.281 | 0.023 | 4.9 |
| ER | 0.600 | 0.009 | 0.567 | 0.014 | −5.5 | 0.197 | 0.010 | 0.136 | 0.021 | −30.8 |
| ER-lbd | 0.683 | 0.018 | 0.712 | 0.017 | 4.2 | 0.366 | 0.030 | 0.461 | 0.019 | 25.8 |
| HSE | 0.553 | 0.013 | 0.576 | 0.016 | 4.2 | 0.118 | 0.018 | 0.147 | 0.026 | 24.4 |
| MMP | 0.651 | 0.016 | 0.689 | 0.012 | 5.9 | 0.336 | 0.029 | 0.454 | 0.019 | 35.0 |
| P53 | 0.573 | 0.018 | 0.594 | 0.020 | 3.6 | 0.203 | 0.026 | 0.202 | 0.020 | −0.2 |
| PPARg | 0.689 | 0.023 | 0.689 | 0.022 | 0.0 | 0.278 | 0.024 | 0.245 | 0.043 | −12.0 |
Percent difference between mean ROC_AUC values for the Max-Sim and AS methods.
Percent difference between mean BEDROC values for the Max-Sim and AS methods.
Mean and standard deviation of ROC_AUC and BEDROC values derived from a fingerprint-based similarity search using the rank by maximum similarity (Max-Sim) and rank by aggregated score (AS) methods, using 10,000 structurally diverse compounds as inactive compounds.
| 4-MU | 0.513 | 0.007 | 0.525 | 0.016 | 2.3 | 0.143 | 0.012 | 0.175 | 0.014 | 22.1 |
| ALDH1A1 | 0.510 | 0.004 | 0.519 | 0.004 | 1.8 | 0.142 | 0.004 | 0.144 | 0.003 | 1.0 |
| BRCA1 | 0.643 | 0.008 | 0.657 | 0.008 | 2.3 | 0.143 | 0.008 | 0.146 | 0.009 | 2.1 |
| DNApb | 0.527 | 0.010 | 0.588 | 0.009 | 11.7 | 0.111 | 0.007 | 0.173 | 0.007 | 55.5 |
| ERK | 0.624 | 0.011 | 0.672 | 0.008 | 7.6 | 0.247 | 0.014 | 0.299 | 0.011 | 21.3 |
| GCN5L2 | 0.542 | 0.015 | 0.542 | 0.017 | −0.1 | 0.132 | 0.009 | 0.133 | 0.012 | 0.9 |
| hERG | 0.725 | 0.011 | 0.794 | 0.006 | 9.6 | 0.411 | 0.032 | 0.522 | 0.024 | 27.2 |
| Lucif | 0.735 | 0.009 | 0.782 | 0.006 | 6.3 | 0.285 | 0.015 | 0.335 | 0.010 | 17.5 |
| MiRNAs | 0.613 | 0.006 | 0.635 | 0.006 | 3.6 | 0.143 | 0.006 | 0.153 | 0.007 | 7.0 |
| Mitoch | 0.512 | 0.008 | 0.513 | 0.007 | 0.1 | 0.082 | 0.008 | 0.097 | 0.008 | 18.4 |
| NPC1 | 0.681 | 0.005 | 0.705 | 0.007 | 3.6 | 0.222 | 0.005 | 0.242 | 0.008 | 9.0 |
| PR901 | 0.896 | 0.019 | 0.901 | 0.020 | 0.5 | 0.718 | 0.030 | 0.757 | 0.037 | 5.5 |
Percent difference between mean ROC_AUC values for the Max-Sim and AS methods.
Percent difference between mean BEDROC values for the Max-Sim and AS methods.
Summary of the performance of similarity search methods on 40 DUD and 102 DUDE datasets.
| ROC_AUC | 0.91 | 1.0 | 30 | 10 | 13 | 0 |
| BEDROC | 0.79 | 1.7 | 30 | 10 | 16 | 1 |
| ROC_AUC | 0.90 | 1.2 | 28 | 12 | 15 | 3 |
| BEDROC | 0.76 | 1.1 | 25 | 15 | 16 | 10 |
| ROC_AUC | 0.96 | 0.5 | 90 | 12 | 18 | 0 |
| BEDROC | 0.90 | 0.7 | 87 | 15 | 28 | 2 |
| ROC_AUC | 0.96 | 0.2 | 55 | 47 | 13 | 3 |
| BEDROC | 0.90 | 0.2 | 58 | 44 | 27 | 20 |
DUD: Directory of Useful Decoys, .
DUDE: Database of Useful Decoys: Enhanced, .
Mean ROC_AUC or BEDROC value calculated from the Max-Sim method over 40 DUD or 102 DUDE datasets.
Mean percentage difference between ROC_AUC or BEDROC values derived from the maxZ or AS methods and the Max-Sim method.
Number of datasets for which ROC_AUC or BEDROC values calculated from the maxZ or AS methods were higher than or equal to the corresponding values calculated from the Max-Sim method, i.e., the number of datasets on which the maxZ or AS method performed comparable to or better than the Max-Sim method did.
Number of datasets for which ROC_AUC or BEDROC values calculated from the maxZ or AS methods were lower than the corresponding values calculated from the Max-Sim method, i.e., the number of datasets on which the maxZ or AS method performed worse than the Max-Sim method.
Number of datasets for which ROC_AUC or BEDROC values calculated from the maxZ or AS methods were at least 1% higher than the corresponding values calculated from the Max-Sim method.
Number of datasets for which ROC_AUC or BEDROC values calculated from the Max-Sim method was more than 1% higher than the corresponding values calculated from the maxZ or AS methods.