| Literature DB >> 20150999 |
Abstract
BACKGROUND: There are three main problems associated with the virtual screening of bioassay data. The first is access to freely-available curated data, the second is the number of false positives that occur in the physical primary screening process, and finally the data is highly-imbalanced with a low ratio of Active compounds to Inactive compounds. This paper first discusses these three problems and then a selection of Weka cost-sensitive classifiers (Naive Bayes, SVM, C4.5 and Random Forest) are applied to a variety of bioassay datasets.Entities:
Year: 2009 PMID: 20150999 PMCID: PMC2820499 DOI: 10.1186/1758-2946-1-21
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Summary of Primary Screen false positives
| Primary (PS) | Confirmatory (CS) | PS Actives | CS Tested | CS Actives | False Positive % |
|---|---|---|---|---|---|
| AID604 | AID644 | 212 | 206 | 67 | 65.57% |
| AID1284 | AID746 | 366 | 362 | 57 | 83.33% |
| AID439 | AID373 | 62 | 69 | 13 | 90.32% |
| AID721 | AID687 | 94 | 94 | 21 | 77.66% |
| AID561 | AID611 | 278 | 273 | 195 | 28.06% |
| AID525 | AID600 | 359 | 359 | 213 | 40.67% |
Summary of Bioassay datasets used in the predictive models
| Assay | No of Attributes | Screening Type | Compounds | Minority Class % |
|---|---|---|---|---|
| AID362 | 144 | Primary | 4279 | 1.4% |
| AID604 | 154 | Primary | 59788 | 0.35% |
| AID456 | 153 | Primary | 9982 | 0.27% |
| AID688 | 154 | Primary | 27198 | 0.91% |
| AID373 | 154 | Primary | 59788 | 0.1% |
| AID746 | 154 | Primary | 59788 | 0.61% |
| AID687 | 153 | Primary | 33067 | 0.28% |
| AID746&AID1284 | 154 | Primary and Confirmatory | 59784 | 0.1% |
| AID604&AID644 | 154 | Primary and Confirmatory | 59782 | 0.11% |
| AID373&AID439 | 154 | Primary and Confirmatory | 59795 | 0.02% |
| AID687&AID721 | 153 | Primary and Confirmatory | 33046 | 0.06% |
| AID1608 | 154 | Confirmatory | 1033 | 6.58% |
| AID644 | 100 | Confirmatory | 206 | 32.52% |
| AID1284 | 103 | Confirmatory | 362 | 15.75% |
| AID439 | 81 | Confirmatory | 69 | 18.84% |
| AID721 | 87 | Confirmatory | 94 | 22.34% |
| AID1608 | 914 | Confirmatory | 1033 | 6.58% |
| AID644 | 914 | Confirmatory | 206 | 32.52% |
| AID1284 | 914 | Confirmatory | 362 | 15.75% |
| AID439 | 914 | Confirmatory | 69 | 18.84% |
| AID721 | 914 | Confirmatory | 94 | 22.34% |
A typical Cost Matrix which shows the misclassification cost for Positives and Negatives
| Actual Positive | Actual Negative | |
|---|---|---|
| 0 TP | 1 FP | |
| 5 FN | 0 TN | |
Misclassification Costs per primary screen dataset and mixed primary/confirmatory datasets
| Dataset | Naive Bayes | SMO | Random Forest | J48 |
|---|---|---|---|---|
| AID362 | 40 | 150 | 3000 | 285 |
| AID604 | 40 | 250 | Out of memory | 650 |
| AID456 | 18 | 200 | 100000 | 1000 |
| AID688 | 34 | 78 | Out of memory | 220 |
| AID373 | 20 | 2000 | Out of memory | 3000 |
| AID746 | 25 | 100 | Out of memory | 450 |
| AID687 | 50 | 250 | Out of memory | 680 |
| AID746&AID1284 | 100 | 1000 | Out of memory | 1900 |
| AID604&AID644 | 70 | 750 | Out of memory | 1500 |
| AID373&AID439 | 70 | 9000 | Out of memory | 9500 |
| AID687&AID721 | 700 | 6702 | Out of memory | 1900 |
Misclassification Costs for False Negatives per confirmatory dataset
| Dataset | Naive Bayes | SMO | Random Forest | J48 |
|---|---|---|---|---|
| AID1608a | 2 | 5 | 75 | 25 |
| AID644a | None* | None | None | None* |
| AID1284a | None* | 2.7 | 8 | 2 |
| AID439a | None* | None | None | None |
| AID721a | None* | None | None | None |
| AID1608b | None | 200 | 75 | 30 |
| AID644b | None* | None | 1.5 | 2 |
| AID1284b | None* | 6 | 8 | 2 |
| AID439b | None | None | 3 | None* |
| AID721b | None | None | None | None |
Figure 1Primary Screen datasets: True Positive rate with under or approximately a 20% False Positive rate. The True Positive rate achieved by each type of classifier for the Primary Screen datasets. A maximum limit of 20% False Positives were allowed.
The True Positive and False Positive rates for the confirmatory bioassay datasets
| Dataset | Naive Bayes | SMO | Random Forest | J48 | ||||
|---|---|---|---|---|---|---|---|---|
| TP% | FP% | TP% | FP% | TP% | FP% | TP% | FP% | |
| AID1608a (154) | 23.08 | 19.17 | 30.77 | 8.81 | 30.77 | 8.29 | 15.78 | 20.21 |
| AID644a (100) | 38.46 | 39.29 | 23.08 | 17.86 | 23.08 | 7.14 | 38.46 | 28.57 |
| AID1284a (103) | 27.27 | 26.23 | 36.36 | 13.11 | 45.45 | 18.03 | 54.55 | 13.11 |
| AID439a (81) | 100.00 | 27.27 | 50.00 | 9.09 | 50.00 | 18.18 | 50.00 | 18.18 |
| AID721a (87) | 0.00 | 28.57 | 0.00 | 14.29 | 0.00 | 21.43 | 0.00 | 14.29 |
| AID1608b (914) | 30.77 | 12.95 | 38.46 | 13.47 | 30.77 | 12.95 | 38.46 | 18.13 |
| AID644b (914) | 30.77 | 28.57 | 53.85 | 14.29 | 38.46 | 17.86 | 30.77 | 14.29 |
| AID1284b (914) | 36.36 | 24.59 | 36.36 | 13.11 | 54.55 | 18.03 | 45.45 | 16.39 |
| AID439b (914) | 50.00 | 18.18 | 50.00 | 9.09 | 50.00 | 18.18 | 0.00 | 27.27 |
| AID721b (914) | 0.00 | 14.29 | 0.00 | 21.43 | 0.00 | 7.14 | 0.00 | 0.00 |
Figure 2Mixed datasets: True Positive rate with under or approximately a 20% False Positive rate. The True Positive rate achieved by each type of classifier for the Mixed Primary Screen/Confirmatory Screen datasets. A maximum limit of 20% False Positives were allowed.
Best classification models for the bioassays with mixed, primary and confirmatory data
| Assay | Best Model | TP% | FP% | Accuracy % |
|---|---|---|---|---|
| AID604 | CSC SMO | 64.29 | 20.59 | 79.36 |
| AID644b | SMO | 53.85 | 14.29 | 75.61 |
| AID373 | MetaCost J48 | 75.00 | 14.50 | 85.49 |
| AID439ab | SMO | 50.00 | 9.09 | 84.62 |
| AID746 | MetaCost J48 | 63.01 | 20.30 | 79.60 |
| AID1284b | CSC Random Forest | 54.55 | 18.03 | 77.78 |
| AID687 | CSC Naive Bayes | 44.44 | 18.97 | 80.93 |
| AID721b | J48 | 0.00 | 0.00 | 77.78 |