| Literature DB >> 29238891 |
Samuel Boobier1, Anne Osbourn2, John B O Mitchell3.
Abstract
In this study, we design and carry out a survey, asking human experts to predict the aqueous solubility of druglike organic compounds. We investigate whether these experts, drawn largely from the pharmaceutical industry and academia, can match or exceed the predictive power of algorithms. Alongside this, we implement 10 typical machine learning algorithms on the same dataset. The best algorithm, a variety of neural network known as a multi-layer perceptron, gave an RMSE of 0.985 log S units and an R2 of 0.706. We would not have predicted the relative success of this particular algorithm in advance. We found that the best individual human predictor generated an almost identical prediction quality with an RMSE of 0.942 log S units and an R2 of 0.723. The collection of algorithms contained a higher proportion of reasonably good predictors, nine out of ten compared with around half of the humans. We found that, for either humans or algorithms, combining individual predictions into a consensus predictor by taking their median generated excellent predictivity. While our consensus human predictor achieved very slightly better headline figures on various statistical measures, the difference between it and the consensus machine learning predictor was both small and statistically insignificant. We conclude that human experts can predict the aqueous solubility of druglike molecules essentially equally well as machine learning algorithms. We find that, for either humans or algorithms, combining individual predictions into a consensus predictor by taking their median is a powerful way of benefitting from the wisdom of crowds.Entities:
Year: 2017 PMID: 29238891 PMCID: PMC5729181 DOI: 10.1186/s13321-017-0250-y
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 8.489
Statistical measures of the performance of the 10 machine learning algorithms and the median-based machine learning consensus predictor
| RMSE | R2 | ρ | NC | AAE | |
|---|---|---|---|---|---|
| MLP | 0.985 | 0.706 | 0.837 | 19 | 0.728 |
| RF | 1.165 | 0.583 | 0.736 | 20 | 0.802 |
| Bagging | 1.165 | 0.583 | 0.726 | 20 | 0.803 |
| KNN | 1.204 | 0.540 | 0.704 | 15 | 0.917 |
| ExtraTrees | 1.227 | 0.542 | 0.728 | 18 | 0.837 |
| AdaBoost | 1.235 | 0.545 | 0.708 | 19 | 0.851 |
| PLS | 1.265 | 0.507 | 0.670 | 15 | 0.980 |
| SVM | 1.280 | 0.520 | 0.694 | 16 | 0.925 |
| SGD | 1.429 | 0.577 | 0.752 | 11 | 1.185 |
| Decision tree | 1.813 | 0.260 | 0.530 | 17 | 1.198 |
| ML median | 1.140 | 0.601 | 0.762 | 18 | 0.778 |
We assessed each machine learning method in terms of the root mean squared error (RMSE), coefficient of determination—which is the square of the Pearson correlation coefficient (R2), Spearman rank correlation coefficient (ρ), number of correct predictions within a margin of one log S unit (NC), and average absolute error (AAE)
Comparison of statistical measures of the performance of the median-based machine learning consensus predictor and the median-based human consensus predictor in terms of the root mean squared error (RMSE), coefficient of determination—which is the square of the Pearson correlation coefficient (R2), Spearman rank correlation coefficient (ρ), number of correct predictions within a margin of one log S unit (NC), and average absolute error (AAE)
| Median-based ML | Median-based human | |
|---|---|---|
| RMSE | 1.140 | 1.087 |
| R2 | 0.601 | 0.632 |
| ρ | 0.762 | 0.817 |
| NC | 18 | 21 |
| AAE | 0.778 | 0.732 |
Performance of median-based consensus classifiers, errors are absolute (unsigned) and are measured in log S units
| Compound | ML error | Human error | Difference |
|---|---|---|---|
| 4-Aminobenzoic acid | 0.07 | 0.13 | − 0.06 |
| 4-Aminosalicylic acid | 0.23 | 0.76 | − 0.53 |
| Antipyrine | 3.73 | 2.98 | 0.75 |
| Chloramphenicol | 0.35 | 0.39 | − 0.04 |
| Corticosterone | 0.11 | 0.06 | 0.05 |
| Dapsone | 0.54 | 0.29 | 0.25 |
| Primidone | 0.06 | 0.14 | − 0.08 |
| Estrone | 0.87 | 0.82 | 0.05 |
| Alclofenac | 0.30 | 0.12 | 0.18 |
| 5-Fluorouracil | 0.46 | 0.62 | − 0.16 |
| Griseofulvin | 0.44 | 0.25 | 0.19 |
| Fluometuron | 0.53 | 0.04 | 0.49 |
| Fluconazole | 1.09 | 0.70 | 0.39 |
| Khellin | 0.17 | 0.98 | − 0.81 |
| Clozapine | 1.37 | 0.71 | 0.66 |
| Norethisterone | 0.63 | 0.63 | 0.00 |
| Nicotinic acid | 0.58 | 0.35 | 0.23 |
| Perphenazine | 0.16 | 0.16 | 0.00 |
| Pteridine | 2.22 | 3.02 | − 0.80 |
| Salicylamide | 0.23 | 0.49 | − 0.26 |
| Sulfanilamide | 0.54 | 0.14 | 0.40 |
| Gliclazide | 1.03 | 0.80 | 0.23 |
| Trihexyphenidyl | 1.98 | 1.45 | 0.53 |
| Triphenylene | 0.15 | 0.27 | − 0.12 |
| Mifepristone | 1.57 | 2.00 | − 0.43 |
| Average | 0.778 | 0.732 | 0.046 |
The difference is meaningfully signed, with a positive value where the human median-based classifier performed better on that compound and a negative value where the machine learning median-based classifier performed better
Comparison of statistical measures of the performance of the best single machine learning predictor and the best individual human predictor
| Multi-layer perceptron | Human participant 11 | |
|---|---|---|
| RMSE | 0.985 | 0.942 |
| R2 | 0.706 | 0.723 |
| Spearman ρ | 0.837 | 0.853 |
| Number correct | 19 | 18 |
| AAE | 0.728 | 0.734 |
Performance of best individual classifiers, errors are absolute (unsigned) and are measured in log S units
| Compound | MLP error | Human 11 error | Difference |
|---|---|---|---|
| 4-Aminobenzoic acid | 0.42 | 0.63 | − 0.21 |
| 4-Aminosalicylic acid | 0.39 | 0.04 | 0.35 |
| Antipyrine | 1.90 | 1.48 | 0.42 |
| Chloramphenicol | 0.78 | 0.89 | − 0.11 |
| Corticosterone | 0.00 | 0.76 | − 0.76 |
| Dapsone | 0.41 | 0.09 | 0.32 |
| Primidone | 1.45 | 0.36 | 1.09 |
| Estrone | 0.78 | 1.32 | − 0.54 |
| Alclofenac | 0.02 | 1.13 | − 1.11 |
| 5-Fluorouracil | 0.07 | 0.97 | − 0.90 |
| Griseofulvin | 0.90 | 1.25 | − 0.35 |
| Fluometuron | 0.33 | 0.46 | − 0.13 |
| Fluconazole | 0.22 | 0.20 | 0.02 |
| Khellin | 0.13 | 0.02 | 0.11 |
| Clozapine | 0.33 | 0.76 | − 0.43 |
| Norethisterone | 1.53 | 0.37 | 1.16 |
| Nicotinic acid | 0.59 | 0.15 | 0.44 |
| Perphenazine | 0.53 | 0.84 | − 0.31 |
| Pteridine | 1.00 | 0.02 | 0.98 |
| Salicylamide | 0.23 | 1.34 | − 1.11 |
| Sulfanilamide | 0.71 | 0.14 | 0.57 |
| Gliclazide | 1.30 | 0.29 | 1.01 |
| Trihexyphenidyl | 2.93 | 2.20 | 0.73 |
| Triphenylene | 0.38 | 0.73 | − 0.35 |
| Mifepristone | 0.86 | 1.90 | − 1.04 |
| Average | 0.728 | 0.734 | − 0.005 |
The difference is meaningfully signed, with a positive value where the best human classifier performed better on that compound and a negative value where the best machine learning classifier performed better
The 25 test set compounds ranked by the average of the absolute prediction errors of the two consensus predictors (mean absolute median error, MAME)
| Compound | Log S | ML median | Error | Human median | Error | MAME | NC |
|---|---|---|---|---|---|---|---|
| Corticosterone | − 3.24 | − 3.13 | 0.11 | − 3.30 | − 0.06 | 0.09 | 22 |
| 4-Aminobenzoic acid | − 1.37 | − 1.44 | − 0.07 | − 1.50 | − 0.13 | 0.10 | 26 |
| Primidone | − 2.64 | − 2.70 | − 0.06 | − 2.50 | 0.14 | 0.10 | 23 |
| Perphenazine | − 4.16 | − 4.32 | − 0.16 | − 4.00 | 0.16 | 0.16 | 16 |
| Alclofenac | − 3.13 | − 2.83 | 0.30 | − 3.25 | − 0.12 | 0.21 | 18 |
| Triphenylene | − 6.73 | − 6.58 | 0.15 | − 7.00 | − 0.27 | 0.21 | 19 |
| Fluometuron | − 3.46 | − 2.93 | 0.53 | − 3.50 | − 0.04 | 0.29 | 19 |
| Sulfanilamide | − 1.36 | − 1.90 | − 0.54 | − 1.50 | − 0.14 | 0.34 | 23 |
| Griseofulvin | − 3.25 | − 2.81 | 0.44 | − 3.00 | 0.25 | 0.35 | 15 |
| Salicylamide | − 1.84 | − 1.61 | 0.23 | − 1.35 | 0.49 | 0.36 | 20 |
| Chloramphenicol | − 2.11 | − 2.46 | − 0.35 | − 2.50 | − 0.39 | 0.37 | 20 |
| Dapsone | − 3.09 | − 3.63 | − 0.54 | − 2.80 | 0.29 | 0.42 | 18 |
| Nicotinic acid | − 0.85 | − 1.43 | − 0.58 | − 1.20 | − 0.35 | 0.47 | 20 |
| 4-Aminosalicylic acid | − 1.96 | − 1.73 | 0.23 | − 1.20 | 0.76 | 0.49 | 21 |
| 5-Fluorouracil | − 1.03 | − 1.49 | − 0.46 | − 1.65 | − 0.62 | 0.54 | 23 |
| Khellin | − 3.02 | − 3.19 | − 0.17 | − 4.00 | − 0.98 | 0.58 | 18 |
| Norethisterone | − 4.63 | − 4.00 | 0.63 | − 4.00 | 0.63 | 0.63 | 15 |
| Estrone | − 5.32 | − 4.45 | 0.87 | − 4.50 | 0.82 | 0.85 | 14 |
| Fluconazole | − 1.80 | − 2.89 | − 1.09 | − 2.50 | − 0.70 | 0.90 | 15 |
| Gliclazide | − 4.29 | − 3.26 | 1.03 | − 3.49 | 0.80 | 0.91 | 11 |
| Clozapine | − 3.24 | − 4.61 | − 1.37 | − 3.95 | − 0.71 | 1.04 | 13 |
| Trihexyphenidyl | − 5.20 | − 3.22 | 1.98 | − 3.75 | 1.45 | 1.72 | 6 |
| Mifepristone | − 5.90 | − 4.33 | 1.57 | − 3.90 | 2.00 | 1.79 | 4 |
| Pteridine | 0.02 | − 2.20 | − 2.22 | − 3.00 | − 3.02 | 2.62 | 2 |
| Antipyrine | 0.48 | − 3.25 | − 3.73 | − 2.50 | − 2.98 | 3.35 | 0 |
Fig. 1Per-compound distribution of the average of the absolute prediction errors of the two consensus predictors (Mean Absolute Median Error, MAME) for the test set. Compounds with errors more than one standard deviation above or below the mean signed error are in orange, those more than two standard deviations away are in red
Fig. 2Magnitude of the average of the absolute prediction errors of the two consensus predictors (Mean Absolute Median Error, MAME) plotted against experimental log S for the 25 compounds in the test set
Fig. 3Number of correct predictions within ± 1 log S unit recorded for each molecule from 27 predictors (10 machine learning, 17 human), plotted against experimental log S