| Literature DB >> 32685821 |
Alexander L Perryman1, Daigo Inoyama1, Jimmy S Patel1, Sean Ekins2, Joel S Freundlich1,3.
Abstract
Solubility is a key metric for therapeutic compounds. Conversely, insoluble compounds cloud the accuracy of assays at all stages of chemical biology and drug discovery. Herein, we disclose naïve Bayesian classifier models to predict aqueous solubility. Publicly accessible aqueous solubility data were used to create two full, or nonpruned, training sets. These two sets were also combined to create a full fused set, and a training set comprised of a literature collation of solubility data was also considered as a reference. We tested different extents of data pruning on the training sets and constructed machine learning models that were evaluated with two independent, external test sets that contained compounds that were different from the training sets. The best pruned and fused model was significantly more accurate, in comparison to either the full model or the full fused model, with the prediction of these external test sets. By carefully removing data from the training set, less information can be used to create more accurate machine learning models for aqueous solubility. This knowledge and the curated training sets should prove useful to future machine learning approaches.Entities:
Year: 2020 PMID: 32685821 PMCID: PMC7364544 DOI: 10.1021/acsomega.0c01251
Source DB: PubMed Journal: ACS Omega ISSN: 2470-1343
Internal Statistics and External Statistics for Select Models
| Bayesian model | ROC score | sensitivity | specificity | concordance | MCC | Cohen’s kappa | F1 score |
|---|---|---|---|---|---|---|---|
| internal set (with five-fold cross validation) | |||||||
| Full MLSMR | 0.836 | 77.8 | 81.5 | 79.5 | n/a | n/a | n/a |
| Full AZ | 0.744 | 86.7 | 76.8 | 80.3 | n/a | n/a | n/a |
| Full Fused AZ + MLSMR | 0.833 | 76.6 | 82.3 | 79.2 | n/a | n/a | n/a |
| Pruned and Fused AZ + MLSMR | 0.941 | 93.9 | 86.7 | 86.9 | n/a | n/a | n/a |
| Huuskonen | 0.926 | 89.9 | 94.1 | 90.9 | n/a | n/a | n/a |
| External PubChem set | |||||||
| Full MLSMR | 0.534 | 37.1 | 45.7 | 44.2 | –0.13 | –0.10 | 0.19 |
| Full
AZ | 0.761 | 34.3 | 32.7 | 33.0 | –0.26 | –0.17 | 0.15 |
| Full Fused AZ + MLSMR | 0.687 | 34.3 | 42.6 | 41.1 | –0.18 | –0.13 | 0.17 |
| Pruned and Fused AZ + MLSMR | 0.824 | 85.7 | 65.4 | 69.0 | 0.39 | 0.33 | 0.50 |
| Huuskonen | 0.630 | 25.7 | 59.9 | 53.8 | –0.11 | –0.10 | 0.17 |
| External JSF set | |||||||
| Full MLSMR | 0.756 | 35.7 | 89.5 | 75.0 | 0.30 | 0.28 | 0.43 |
| Full AZ | 0.714 | 78.6 | 60.5 | 65.4 | 0.35 | 0.31 | 0.55 |
| Full Fused AZ + MLSMR | 0.842 | 35.7 | 89.5 | 75.0 | 0.30 | 0.28 | 0.43 |
| Pruned and
Fused AZ + MLSMR | 0.724 | 64.3 | 86.8 | 80.8 | 0.51 | 0.51 | 0.64 |
| Huuskonen | 0.835 | 78.6 | 68.4 | 71.2 | 0.42 | 0.39 | 0.59 |
The Full MLSMR model was constructed from the training set composed of 57 824 unique compounds, of which 54.7% were defined as soluble using our ≥100 μM criteria.
The Full AstraZeneca (or AZ) model was built from the training set composed of 1763 compounds, of which 35.1% were soluble.
The Full Fused MLSMR + AZ model was generated by combining all of the compounds from the Full MLSMR set with the Full AstraZeneca set. It had 59 510 compounds, of which 54.2% were defined as soluble.
The Pruned and Fused MLSMR + AZ model corresponds to the training set in which we combined a pruned version of the AZ set (compounds with a solubility of 25–99 μM were deleted) with a pruned version of the MLSMR set (only the subset of compounds with a solubility <25 μM were included). This training set contains 17 460 compounds, of which 3.5% were soluble.
The Huuskonen reference model was constructed using a concatenation of previously published training and test sets.[7] It has 1290 compounds, of which 76.5% are soluble.
The ROC score is the area under the receiver operating characteristic curve from a five-fold cross-validation study.
Sensitivity represents the percentage of correctly identified soluble compounds (true positives).
Specificity signifies the percentage of correctly identified insoluble compounds (true negatives).
Concordance corresponds to the overall accuracy: percentage of (true positives + true negatives)/total number of compounds.
The External PubChem set has 197 unique compounds (that are not part of any training set), of which 17.8% are soluble.
The External JSF set has 52 compounds, of which 26.9% are soluble.
Matthew’s Correlation Coefficient (MCC), also referred to as the phi coefficient, is a chance-corrected statistic where MCC = 1 indicates perfect agreement, MCC = −1 indicates total disagreement, and MCC = 0 indicates that the model is no better than random.
Cohen’s Kappa is a chance-corrected statistic that uses a different method to calculate the random likelihood of making correct predictions for the external set. Kappa <0 indicates no agreement, Kappa of 0–0.2 indicates slight agreement, Kappa of 0.21–0.40 is fairly predictive, and Kappa of 0.41–0.60 indicates moderate agreement.[30]
The F1 score is the harmonic mean of precision (i.e., the positive predictive value hit rate) and sensitivity. The best F1 score is 1, while the worst score possible is 0.
Figure 1ROC curves for evaluation of the solubility Bayesian models with the (A) External PubChem and (B) External JSF sets.
Figure 2Principal component analysis (PCA) comparing the Pruned and Fused AZ + MLSMR training set (in red) to the Huuskonen reference training set (in blue). (A) shows all compounds with the same radii, while (B) has insoluble compounds displayed with smaller radii.