| Literature DB >> 31070704 |
Oliver P Watson1, Isidro Cortes-Ciriano1,2, Aimee R Taylor3,4, James A Watson5,6.
Abstract
MOTIVATION: Artificial intelligence, trained via machine learning (e.g. neural nets, random forests) or computational statistical algorithms (e.g. support vector machines, ridge regression), holds much promise for the improvement of small-molecule drug discovery. However, small-molecule structure-activity data are high dimensional with low signal-to-noise ratios and proper validation of predictive methods is difficult. It is poorly understood which, if any, of the currently available machine learning algorithms will best predict new candidate drugs.Entities:
Mesh:
Year: 2019 PMID: 31070704 PMCID: PMC6853675 DOI: 10.1093/bioinformatics/btz293
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Twenty-five publicly available datasets extracted from ChEMBL and analysed in this article
| Target preferred name | Target abbreviation | Uniprot ID | ChEMBL ID | #Bioactive molecules |
|---|---|---|---|---|
| Alpha-2a adrenergic receptor | A2a | P08913 | 1867 | 203 |
| Tyrosine-protein kinase ABL | ABL1 | P00519 | 1862 | 773 |
| Acetylcholinesterase | Acetylcholin | P22303 | 220 | 3159 |
| Androgen Receptor | Androgen | P10275 | 1871 | 1290 |
| Serine/threonine-protein kinase Aurora-A | Aurora-A | O14965 | 4722 | 2125 |
| Serine/threonine-protein kinase B-raf | B-raf | P15056 | 5145 | 1730 |
| Cannabinoid CB1 receptor | Cannabinoid | P21554 | 218 | 1116 |
| Carbonic anhydrase II | Carbonic | P00918 | 205 | 603 |
| Caspase-3 | Caspase | P42574 | 2334 | 1606 |
| Thrombin | Coagulation | P00734 | 204 | 1700 |
| Cyclooxygenase-1 | COX-1 | P23219 | 221 | 1343 |
| Cyclooxygenase-2 | COX-2 | P35354 | 230 | 2855 |
| Dihydrofolate reductase | Dihydrofolate | P00374 | 202 | 584 |
| Dopamine D2 receptor | Dopamine | P14416 | 217 | 479 |
| Norepinephrine transporter | Ephrin | P23975 | 222 | 1740 |
| Epidermal growth factor receptor erbB1 | erbB1 | P00533 | 203 | 4 868 |
| Estrogen receptor alpha | Estrogen | P03372 | 206 | 1705 |
| Glucocorticoid receptor | Glucocorticoid | P04150 | 2034 | 1447 |
| Glycogen synthase kinase-3 beta | Glycogen | P49841 | 262 | 1757 |
| HERG | HERG | Q12809 | 240 | 5207 |
| Tyrosine-protein kinase JAK2 | JAK2 | O60674 | 2971 | 2655 |
| Tyrosine-protein kinase LCK | LCK | P06239 | 258 | 1352 |
| Monoamine oxidase A | Monoamine | P21397 | 1951 | 1379 |
| Mu opioid receptor | Opioid | P35372 | 233 | 840 |
| Vanilloid receptor | Vanilloid | Q8NER1 | 4794 | 1923 |
Fig. 1.Model comparison using the standard bootstrap. Expected model out-of-sample mean squared error shown for each dataset, ordered from left to right by increasing size of dataset. Error bars correspond to ±2 standard errors around the expected loss estimate, computed using the jackknife estimator. For each dataset, the optimal model is the one with least expected loss, with random forests scoring best for every single dataset. Datasets are ordered from left to right by increasing size
Fig. 2.Comparison of overall model performance for the standard bootstrap and the restricted activity bootstrap. All four panels show the overall model score (sum of the probabilities of model optimality over the 25 datasets) as a function of the restriction on the activity levels in the training data. About 100% corresponds to standard cross-validation (random partitioning). The first three panels show the results for the active-rank loss functions ( shown by thick lines; shown by dashed lines) with values of γ going from 0.9 (top left) to 0.99 (bottom left). The bottom right panel shows the results when models are scored using mean squared error (dot-dashed lines). Red: deep learning; blue: support vector regression; orange: random forests; green: ridge regression
Fig. 3.Model comparison using the restricted activity bootstrap with . Model expected out-of-sample loss shown for each dataset, ordered from left to right by increasing size of dataset. Error bars correspond to ±2 standard errors around the expected loss estimate, computed using the jackknife estimator. For each dataset, the optimal model is the one with least expected loss. Datasets are ordered from left to right by increasing size