| Literature DB >> 28384344 |
Rafał Kurczab1, Andrzej J Bojarski1.
Abstract
The machine learning-based virtual screening of molecular databases is a commonly used approach to identify hits. However, many aspects associated with training predictive models can influence the final performance and, consequently, the number of hits found. Thus, we performed a systematic study of the simultaneous influence of the proportion of negatives to positives in the testing set, the size of screening databases and the type of molecular representations on the effectiveness of classification. The results obtained for eight protein targets, five machine learning algorithms (SMO, Naïve Bayes, Ibk, J48 and Random Forest), two types of molecular fingerprints (MACCS and CDK FP) and eight screening databases with different numbers of molecules confirmed our previous findings that increases in the ratio of negative to positive training instances greatly influenced most of the investigated parameters of the ML methods in simulated virtual screening experiments. However, the performance of screening was shown to also be highly dependent on the molecular library dimension. Generally, with the increasing size of the screened database, the optimal training ratio also increased, and this ratio can be rationalized using the proposed cost-effectiveness threshold approach. To increase the performance of machine learning-based virtual screening, the training set should be constructed in a way that considers the size of the screening database.Entities:
Mesh:
Year: 2017 PMID: 28384344 PMCID: PMC5383296 DOI: 10.1371/journal.pone.0175410
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Composition of the training and test sets used.
| Target | ChEMBL class | ChEMBL target ID | Number of actives | |
|---|---|---|---|---|
| Training set | Test set | |||
| membrane receptor | CHEMBL214 | 198 | 903 | |
| enzyme/protease | ChEMBL243 | 203 | 932 | |
| transporter | CHEMBL228 | 390 | 1822 | |
| nuclear receptor | CHEMBL206 | 133 | 614 | |
| enzyme/hydrolase | CHEMBL220 | 162 | 743 | |
| enzyme/phosphodiesterase | CHEMBL1827 | 152 | 695 | |
| enzyme/kinase | CHEMBL301 | 236 | 1084 | |
| membrane receptor | CHEMBL1800 | 200 | 914 | |
Machine learning algorithms used and a short description of their training parameters.
| Classifier | Classification scheme | Settings |
|---|---|---|
| functions | The complexity parameter was set at 1, the epsilon for a round-off error was 1.0 E-12, and the option of normalizing training data was chosen. The normalized polynomial kernel was used. | |
| bayes | – | |
| lazy | The nearest neighbor search algorithm using the Euclidean distance function and 1 neighbor. | |
| trees | C.4.5 pruning | |
| trees | Trees with unlimited depth, seed number: 1. Number of generated trees: 10. |
athe SVM algorithm implemented in WEKA,
bthe k-NN algorithm implemented in WEKA,
cthe decision tree algorithm implemented in WEKA.
Fig 1The dependence of the negative training set size on machine learning-based virtual screening performance for 2 types of fingerprints (panel A–CDK FP, and MACCS FP in B) averaged over 10 independent trials.
The colored lines denote the type of evaluated parameter used (blue–recall, red–precision, magenta–MCC and green–PR plot).
Fig 2The dependency of the IN/A training ratio on the cost-effectiveness thresholds for different screening library sizes.
Fig 3The dependency of the optimal IN/A training ratio from the size of the screening library obtained for several arbitrarily selected cost-effectiveness thresholds.
For comparison, the training ratio obtained for the best MCC was added (black line).
The optimal IN/A training ratios obtained for a cost-effectiveness threshold equal 0.03.
| Target | Screening library size | Best IN/A ratio | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| SMO | NB | Ibk | J48 | RF | |||||||
| CDK FP | MACCS | CDK FP | MACCS | CDK FP | MACCS | CDK FP | MACCS | CDK FP | MACCS | ||
| 5000 | 2 | 2 | 2 | 60 | 10 | 10 | 10 | 2 | 4 | 4 | |
| 50000 | 7 | 7 | 0.5 | 40 | 60 | 80 | 60 | 15 | 10 | 15 | |
| 400000 | 40 | 40 | 0.5 | 10 | 100 | 100 | 80 | 80 | 40 | 60 | |
| 5000 | 2 | 2 | 0.5 | 4 | 4 | 4 | 7 | 4 | 4 | 4 | |
| 50000 | 4 | 10 | 4 | 10 | 10 | 15 | 40 | 40 | 7 | 15 | |
| 400000 | 10 | 40 | 2 | 2 | 40 | 40 | 80 | 80 | 20 | 60 | |
| 5000 | 1 | 1 | 0.5 | 0.5 | 4 | 1 | 2 | 1 | 1 | 1 | |
| 50000 | 2 | 4 | 0.5 | 10 | 20 | 10 | 10 | 4 | 2 | 7 | |
| 400000 | 7 | 20 | 0.5 | 7 | 30 | 30 | 30 | 20 | 7 | 20 | |
| 5000 | 4 | 4 | 1 | 15 | 7 | 7 | 7 | 7 | 4 | 4 | |
| 50000 | 7 | 15 | 7 | 25 | 30 | 60 | 60 | 90 | 15 | 25 | |
| 400000 | 7 | 90 | 7 | 7 | 60 | 90 | 60 | 90 | 15 | 90 | |
| 5000 | 2 | 2 | 2 | 50 | 10 | 4 | 10 | 10 | 4 | 5 | |
| 50000 | 7 | 10 | 2 | 50 | 50 | 15 | 70 | 70 | 10 | 25 | |
| 400000 | 10 | 50 | 4 | 2 | 70 | 100 | 100 | 70 | 15 | 100 | |
| 5000 | 2 | 2 | 0.5 | 4 | 4 | 10 | 15 | 10 | 4 | 4 | |
| 50000 | 7 | 10 | 0.5 | 50 | 20 | 50 | 100 | 50 | 10 | 20 | |
| 400000 | 10 | 50 | 2 | 10 | 50 | 100 | 80 | 100 | 15 | 100 | |
| 5000 | 2 | 15 | 2 | 7 | 4 | 4 | 7 | 4 | 4 | 4 | |
| 50000 | 4 | 15 | 2 | 4 | 30 | 30 | 80 | 50 | 7 | 15 | |
| 400000 | 7 | 30 | 0.5 | 0.5 | 50 | 50 | 50 | 50 | 10 | 30 | |
| 5000 | 2 | 2 | 2 | 10 | 4 | 4 | 4 | 4 | 4 | 4 | |
| 50000 | 7 | 10 | 2 | 40 | 40 | 60 | 60 | 40 | 10 | 20 | |
| 400000 | 40 | 80 | 0.5 | 7 | 60 | 80 | 80 | 80 | 40 | 80 | |