| Literature DB >> 29468427 |
Fredrik Svensson1,2, Avid M Afzal3, Ulf Norinder4,5, Andreas Bender3.
Abstract
Iterative screening has emerged as a promising approach to increase the efficiency of screening campaigns compared to traditional high throughput approaches. By learning from a subset of the compound library, inferences on what compounds to screen next can be made by predictive models, resulting in more efficient screening. One way to evaluate screening is to consider the cost of screening compared to the gain associated with finding an active compound. In this work, we introduce a conformal predictor coupled with a gain-cost function with the aim to maximise gain in iterative screening. Using this setup we were able to show that by evaluating the predictions on the training data, very accurate predictions on what settings will produce the highest gain on the test data can be made. We evaluate the approach on 12 bioactivity datasets from PubChem training the models using 20% of the data. Depending on the settings of the gain-cost function, the settings generating the maximum gain were accurately identified in 8-10 out of the 12 datasets. Broadly, our approach can predict what strategy generates the highest gain based on the results of the cost-gain evaluation: to screen the compounds predicted to be active, to screen all the remaining data, or not to screen any additional compounds. When the algorithm indicates that the predicted active compounds should be screened, our approach also indicates what confidence level to apply in order to maximize gain. Hence, our approach facilitates decision-making and allocation of the resources where they deliver the most value by indicating in advance the likely outcome of a screening campaign.Entities:
Keywords: Conformal prediction; Gain-cost function; HTS; PubChem datasets
Year: 2018 PMID: 29468427 PMCID: PMC5821614 DOI: 10.1186/s13321-018-0260-4
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
The datasets employed in this study
| AID | Description | Active | Inactive | % Active |
|---|---|---|---|---|
| 411 | qHTS Assay for Inhibitors of Firefly Luciferase | 1577 | 70,097 | 2.2 |
| 868 | Screen for Chemicals that Inhibit the RAM Network | 3545 | 191,037 | 1.8 |
| 1030 | qHTS Assay for Inhibitors of Aldehyde Dehydrogenase 1 (ALDH1A1) | 16,117 | 148,322 | 7.8 |
| 1460 | qHTS for Inhibitors of Tau Fibril Formation, Thioflavin T Binding | 5825 | 221,867 | 2.6 |
| 1721 | qHTS Assay for Inhibitors of Leishmania Mexicana Pyruvate Kinase (LmPK) | 1089 | 290,104 | 0.4 |
| 2314 | Cycloheximide Counterscreen for Small Molecule Inhibitors of Shiga Toxin | 37,055 | 259,401 | 12.5 |
| 2326 | qHTS Assay for Inhibitors of Influenza NS1 Protein Function | 1073 | 260,701 | 0.4 |
| 2451 | qHTS Assay for Inhibitors of Fructose-1,6-bisphosphate Aldolase from Giardia Lamblia | 2061 | 276,158 | 0.7 |
| 2551 | qHTS for inhibitors of ROR gamma transcriptional activity | 16,824 | 256,777 | 6.1 |
| 485290 | qHTS Assay for Inhibitors of Tyrosyl-DNA Phosphodiesterase (TDP1) | 986 | 345,663 | 0.3 |
| 485314 | qHTS Assay for Inhibitors of DNA Polymerase Beta | 4522 | 315,791 | 1.4 |
| 504444 | Nrf2 qHTS screen for inhibitors | 7472 | 285,618 | 2.5 |
Number of compounds in training and test data for all the datasets after data processing
| AID | Train active | Train inactive | Test active | Test inactive |
|---|---|---|---|---|
| 411 | 340 | 13,761 | 1215 | 55,187 |
| 868 | 326 | 19,129 | 3219 | 171,705 |
| 1030 | 3240 | 29,090 | 12,674 | 116,642 |
| 1460 | 132 | 4637 | 1057 | 41,197 |
| 1721 | 219 | 57,905 | 868 | 231,624 |
| 2314 | 3730 | 25,769 | 33,225 | 232,103 |
| 2326 | 190 | 51,988 | 877 | 207,835 |
| 2451 | 422 | 54,560 | 1594 | 218,333 |
| 2551 | 1681 | 25,443 | 14,951 | 227,744 |
| 485290 | 192 | 67,593 | 761 | 270,377 |
| 485314 | 857 | 62,561 | 3634 | 250,038 |
| 504444 | 1524 | 56,628 | 5882 | 226,723 |
Fig. 1Schematic representation of the validation procedure used in this study
Fig. 2Illustration of how conformal prediction classes are assigned
Average validity of the physicochemical and fingerprint based models
| Confidence level | ||||
|---|---|---|---|---|
| 90% | 80% | 70% | 60% | |
|
| ||||
| Validity train active | 0.928 | 0.833 | 0.728 | 0.631 |
| Validity train inactive | 0.910 | 0.813 | 0.715 | 0.614 |
| Validity test active | 0.922 | 0.818 | 0.718 | 0.615 |
| Validity test inactive | 0.907 | 0.811 | 0.714 | 0.615 |
|
| ||||
| Validity train active | 0.976 | 0.896 | 0.771 | 0.627 |
| Validity train inactive | 0.949 | 0.888 | 0.809 | 0.694 |
| Validity test active | 0.972 | 0.895 | 0.766 | 0.610 |
| Validity test inactive | 0.943 | 0.884 | 0.810 | 0.714 |
Validity and efficiency for active and inactive compounds at the 80% confidence level for the derived conformal predictors based on physicochemical descriptors
| AID | Validity active | Efficiency active | Validity inactive | Efficiency inactive |
|---|---|---|---|---|
| 411 train | 0.856 | 0.809 | 0.815 | 0.771 |
| 411 test | 0.873 | 0.847 | 0.811 | 0.794 |
| 868 train | 0.828 | 0.798 | 0.813 | 0.835 |
| 868 test | 0.825 | 0.844 | 0.805 | 0.862 |
| 1030 train | 0.823 | 0.654 | 0.819 | 0.636 |
| 1030 test | 0.832 | 0.677 | 0.807 | 0.653 |
| 1460 train | 0.864 | 0.864 | 0.816 | 0.88 |
| 1460 test | 0.748 | 0.944 | 0.805 | 0.957 |
| 1721 train | 0.868 | 0.918 | 0.842 | 0.899 |
| 1721 test | 0.869 | 0.933 | 0.835 | 0.907 |
| 2314 train | 0.813 | 0.81 | 0.807 | 0.808 |
| 2314 test | 0.801 | 0.833 | 0.803 | 0.819 |
| 2326 train | 1 | 0.395 | 0.856 | 0.144 |
| 2326 test | 1 | 0.511 | 0.849 | 0.151 |
| 2451 train | 0.884 | 0.746 | 0.836 | 0.66 |
| 2451 test | 0.859 | 0.778 | 0.828 | 0.707 |
| 2551 train | 0.819 | 0.916 | 0.809 | 0.906 |
| 2551 test | 0.812 | 0.944 | 0.803 | 0.934 |
| 485290 train | 1 | 0.51 | 0.86 | 0.15 |
| 485290 test | 1 | 0.545 | 0.863 | 0.137 |
| 485314 train | 0.846 | 0.762 | 0.824 | 0.726 |
| 485314 test | 0.856 | 0.799 | 0.818 | 0.743 |
| 504444 train | 0.833 | 0.749 | 0.813 | 0.755 |
| 504444 test | 0.818 | 0.767 | 0.811 | 0.771 |
Train denotes the results from the internal validation and test when the models are applied to the external test set
Fig. 3Evaluation of the gain-cost function for three examples showing different trends (using the physicochemical based descriptors models). The dashed line represents test data and the solid line evaluation of the remaining data. Trends observed in the training data generally predict the trend on the remaining test data very well
Average percent loss in gain where training data did not correctly predict maximum gain for the test set
| Cost | Total number of partially screened datasetsa | Fingerprint based models | Physiochemical based models | ||
|---|---|---|---|---|---|
| Number of datasetb | %loss | Number of datasetb | %loss | ||
| 6 | 9 | 6 | 5.7c | 4 | 2.1 |
| 10 | 10 | 3 | 1 | 3 | 1.8 |
| 14 | 10 | 3 | 1.6 | 2 | 0.4 |
aDatasets where the validation did not indicate that the entire set should be screened for maximum gain
bDatasets where the optimum training set validation setting did not correspond to the maximum test set gain
cFails for dataset 2326: 23.9%. Excluding this result: 2.1%
Number of times the highest gain (training and test set) was obtained from fingerprint (FP) and physicochemical (PC) descriptors based models respectively
| Cost | Max gain FP | Max gain PC | Tiesa |
|---|---|---|---|
| 6 | 6 | 3 | 3 |
| 10 | 9 | 1 | 2 |
| 14 | 9 | 1 | 2 |
aTies occur when the validation indicates that the entire library should be screened