| Literature DB >> 31167452 |
Thomas M Kaiser1, Pieter B Burger2,3.
Abstract
Machine learning continues to make strident advances in the prediction of desired properties concerning drug development. Problematically, the efficacy of machine learning in these arenas is reliant upon highly accurate and abundant data. These two limitations, high accuracy and abundance, are often taken together; however, insight into the dataset accuracy limitation of contemporary machine learning algorithms may yield insight into whether non-bench experimental sources of data may be used to generate useful machine learning models where there is a paucity of experimental data. We took highly accurate data across six kinase types, one GPCR, one polymerase, a human protease, and HIV protease, and intentionally introduced error at varying population proportions in the datasets for each target. With the generated error in the data, we explored how the retrospective accuracy of a Naïve Bayes Network, a Random Forest Model, and a Probabilistic Neural Network model decayed as a function of error. Additionally, we explored the ability of a training dataset with an error profile resembling that produced by the Free Energy Perturbation method (FEP+) to generate machine learning models with useful retrospective capabilities. The categorical error tolerance was quite high for a Naïve Bayes Network algorithm averaging 39% error in the training set required to lose predictivity on the test set. Additionally, a Random Forest tolerated a significant degree of categorical error introduced into the training set with an average error of 29% required to lose predictivity. However, we found the Probabilistic Neural Network algorithm did not tolerate as much categorical error requiring an average of 20% error to lose predictivity. Finally, we found that a Naïve Bayes Network and a Random Forest could both use datasets with an error profile resembling that of FEP+. This work demonstrates that computational methods of known error distribution like FEP+ may be useful in generating machine learning models not based on extensive and expensive in vitro-generated datasets.Entities:
Keywords: FEP; Naïve Bayes Network; Neural Network; Random Forest; anaplastic lymphoma kinase (ALK); cheminformatics; drug discovery; error; machine learning
Mesh:
Substances:
Year: 2019 PMID: 31167452 PMCID: PMC6601015 DOI: 10.3390/molecules24112115
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Summary of the ten targets investigated.
| Target | Protein Family | Species |
| Drug Stage |
|---|---|---|---|---|
| Anaplastic lymphoma kinase (ALK) | Receptor Tyr Kinase |
| 1343 | Phase IV |
| Aurora B | Ser/Thr Kinase |
| 1481 | Phase III |
| β-2 Adrenergic Receptor | GPCR |
| 641 | Phase IV |
| c-Abl | Tyr Kinase |
| 1439 | Phase IV |
| Factor Xa | Protease |
| 1657 | Phase IV |
| HIV Protease | Protease | HIV | 2544 | Phase IV |
| JAK2 | Non-receptor Tyr Kinase |
| 3624 | Phase IV |
| MEK1 | MAP Kinase Kinase |
| 823 | Phase IV |
| PARP1 | Polymerase |
| 1933 | Phase IV |
| TYRO3 | Receptor Tyr Kinase |
| 277 | none |
Figure 1Summary of the activity distribution (pIC50) for the ten targets investigated. (B2AR: β-2 adrenergic receptor and HIVP: HIV protease).
Figure 2Workflow for classification threshold evaluation in the Naïve Bayes Network (NBN) and Random Forest (RF) algorithms.
Figure 3Workflow for classification threshold evaluation in the Probabilistic Neural Network (PNN) algorithm.
Figure 4ROC for the NBN trained on the ALK dataset with a <20 nM classification.
Figure 5Enrichment plot for the NBN Trained on the ALK dataset with a <20 nM classification.
Summary of performance for the NBN, RF, and PNN on the ALK dataset.
| Model | ROC AUC | Top 10% Mean IC50 (nM) | Sensitivity | Precision | Enrichment Factor at 10% |
|---|---|---|---|---|---|
| ALK < 20 nM NBN | 0.917 | 6.70 | 0.891 | 0.678 | 2.717 |
| ALK < 20 nM RF | 0.913 | 3.30 | 0.739 | 0.739 | 2.935 |
| ALK < 20 nM PNN | 0.782 | 34.7 | 0.533 | 0.645 | 1.789 |
Figure 6Workflow for error tolerance evaluation in the NBN and RF algorithms.
Figure 7Workflow for error tolerance evaluation in the PNN algorithm.
Performance statistics for failed NBNs generated with the specified error in each split.
| Model | %Error Failure | ROC AUC | Top 10% Mean IC50 (nM) | Mean IC50 (nM) | Bottom 10% Mean IC50 (nM) | Fold Difference in Mean Top 10% IC50 |
|---|---|---|---|---|---|---|
| ALK < 20 nM NBN | Control | 0.917 | 6.70 | 4200 | 17,000 | 2500 |
| Split 1 Error | 45 | 0.674 | 1400 | 4200 | 4900 | 3.5 |
| Split 2 Error | 45 | 0.632 | 900 | 3300 | 8100 | 9.0 |
| Split 3 Error | 50 | 0.539 | 2400 | 4100 | 1100 | 0.46 |
Performance statistics for NBNs with retained predictivity generated with the specified error in each split.
| Model | %Error Before Failure | ROC AUC | Top 10% Mean IC50 (nM) | Mean IC50 (nM) | Bottom 10% Mean IC50 (nM) | Fold Difference in Mean Top 10% IC50 |
|---|---|---|---|---|---|---|
| ALK < 20 nM NBN | Control | 0.917 | 6.70 | 4200 | 17,000 | 2500 |
| Split 1 Pre-failure | 40 | 0.742 | 53.5 | 4200 | 3100 | 58 |
| Split 2 Pre-failure | 40 | 0.747 | 275 | 3300 | 9200 | 33 |
| Split 3 Pre-failure | 45 | 0.703 | 130 | 4100 | 4500 | 35 |
Performance statistics for failed RFs generated with the specified error in each split.
| Model | %Error Failure | ROC AUC | Top 10% Mean IC50 (nM) | Mean IC50 (nM) | Bottom 10% Mean IC50 (nM) | Fold Difference in Mean Top 10% IC50 |
|---|---|---|---|---|---|---|
| ALK < 20 nM RF | Control | 0.913 | 3.33 | 4200 | 16,000 | 4800 |
| Split 1 Error | 30 | 0.762 | 1800 | 4200 | 7500 | 4.2 |
| Split 2 Error | 40 | 0.677 | 383 | 3300 | 8500 | 22 |
| Split 3 Error | 40 | 0.691 | 113 | 4100 | 5100 | 45 |
Performance statistics for RFs with retained predictivity generated with the specified error in each split.
| Model | %Error Before Failure | ROC AUC | Top 10% Mean IC50 (nM) | Mean IC50 (nM) | Bottom 10% Mean IC50 (nM) | Fold Difference in Mean Top 10% IC50 |
|---|---|---|---|---|---|---|
| ALK < 20 nM RF | Control | 0.913 | 3.33 | 4200 | 16,000 | 4800 |
| Split 1 Pre-failure | 25 | 0.828 | 362 | 4200 | 6400 | 18 |
| Split 2 Pre-failure | 35 | 0.739 | 111 | 3300 | 5600 | 50 |
| Split 3 Pre-failure | 35 | 0.746 | 19.5 | 4100 | 7600 | 390 |
Performance statistics for failed PNNs generated with the specified error in each split.
| Model | %Error Failure | ROC AUC | Top 10% Mean IC50 (nM) | Mean IC50 (nM) | Bottom 10% Mean IC50 (nM) | Fold Difference in Mean Top 10% IC50 |
|---|---|---|---|---|---|---|
| ALK < 20 nM PNN | Control | 0.782 | 34.7 | 4200 | 19,000 | 550 |
| Split 1 Error | 20 | 0.654 | 726 | 4200 | 12,000 | 17 |
| Split 2 Error | 30 | 0.615 | 287 | 3300 | 7600 | 26 |
| Split 3 Error | 15 | 0.635 | 1200 | 4100 | 3300 | 2.8 |
Performance statistics for PNNs with retained predictivity generated with the specified error in each split.
| Model | %Error Before Failure | ROC AUC | Top 10% Mean IC50 (nM) | Mean IC50 (nM) | Bottom 10% Mean IC50 (nM) | Fold Difference in Mean Top 10% IC50 |
|---|---|---|---|---|---|---|
| ALK < 20 nM PNN | Control | 0.782 | 34.7 | 4200 | 19,000 | 550 |
| Split 1 Pre-failure | 15 | 0.701 | 14.0 | 4200 | 2700 | 190 |
| Split 2 Pre-failure | 25 | 0.706 | 637 | 3300 | 5500 | 8.6 |
| Split 3 Pre-failure | 10 | 0.736 | 375 | 4100 | 7400 | 20 |
Summary of points of failure for each algorithm, random split and target.
| Target | Algorithm | Split 1 Percent Error of Failure | Split 2 Percent Error of Failure | Split 3 Percent Error of Failure | Mean Percent Error of Failure |
|---|---|---|---|---|---|
| ALK | NBN | 45 | 45 | 50 | 47 |
| RF | 30 | 40 | 40 | 37 | |
| PNN | 20 | 30 | 50 | 33 | |
| Aurora B | NBN | 30 | 30 | 35 | 32 |
| RF | 20 | 20 | 25 | 22 | |
| PNN | - | - | - | - | |
| β-2 | NBN | 50 | 50 | 50 | 50 |
| RF | 35 | 25 | 45 | 35 | |
| PNN | 10 | 20 | 5 | 12 | |
| c-Abl | NBN | 45 | 50 | 45 | 47 |
| RF | 35 | 25 | 35 | 32 | |
| PNN | 30 | 25 | 15 | 23 | |
| Factor Xa | NBN | 35 | 45 | 45 | 42 |
| RF | 30 | 30 | 30 | 30 | |
| PNN | 20 | 10 | 30 | 20 | |
| HIV Protease | NBN | 50 | 45 | 50 | 48 |
| RF | 35 | 20 | 15 | 23 | |
| PNN | 25 | 15 | 25 | 22 | |
| JAK2 | NBN | 40 | 30 | 40 * | 35 |
| RF | 40 | 30 | 35 | 35 | |
| PNN | - | - | - | - | |
| MEK1 | NBN | 40 | 30 | 40 | 37 |
| RF | 30 | 35 | 25 | 30 | |
| PNN | 5 | 5 | 25 ** | 12 | |
| PARP1 | NBN | 5 | 45 | 50 | 33 |
| RF | 40 | 25 | 40 | 35 | |
| PNN | 25 ** | 5 | 25 ** | 18 | |
| TYRO3 | NBN | 30 | 10 | 5 | 15 |
| RF | 20 | 5 | 5 * | 10 | |
| PNN | - | - | - | - |
* failed in the control step, therefore another random split was used. ** reparameterization shifted the point of failure.
Average percent classification error that leads to failure.
| Model | Average %Error Threshold |
|---|---|
| NBN | 39 |
| RF | 29 |
| PNN | 20 |
Performance statistics for NBNs with retained predictivity generated with 27% classification error in molecules between 109 nM and 3.7 nM.
| Model | ROC AUC | Top 10% Mean IC50 (nM) | Mean IC50 (nM) | Bottom 10% Mean IC50 (nM) | Fold Difference in Mean Top 10% IC50 |
|---|---|---|---|---|---|
| ALK < 20 nM NBN | 0.917 | 6.70 | 4200 | 17,000 | 2500 |
| Split 1 Error | 0.916 | 18.0 | 4200 | 17,000 | 940 |
| Split 2 Error | 0.897 | 22.5 | 3300 | 7000 | 310 |
| Split 3 Error | 0.887 | 99.8 | 4100 | 14,000 | 140 |
Performance statistics for RFs with retained predictivity generated with 27% classification error in molecules between 109 nM and 3.7 nM.
| Model | ROC AUC | Top 10% Mean IC50 (nM) | Mean IC50 (nM) | Bottom 10% Mean IC50 (nM) | Fold Difference in Mean Top 10% IC50 |
|---|---|---|---|---|---|
| ALK < 20 nM RF | 0.913 | 3.33 | 4200 | 16,000 | 4800 |
| Split 1 Error | 0.911 | 4.87 | 4200 | 17,000 | 3500 |
| Split 2 Error | 0.920 | 15.6 | 3300 | 12,000 | 770 |
| Split 3 Error | 0.930 | 9.74 | 4100 | 12,000 | 1200 |