| Literature DB >> 34991690 |
Min Wei1, Xudong Zhang1, Xiaolin Pan1, Bo Wang1, Changge Ji1,2, Yifei Qi3, John Z H Zhang4,5,6,7,8.
Abstract
Human oral bioavailability (HOB) is a key factor in determining the fate of new drugs in clinical trials. HOB is conventionally measured using expensive and time-consuming experimental tests. The use of computational models to evaluate HOB before the synthesis of new drugs will be beneficial to the drug development process. In this study, a total of 1588 drug molecules with HOB data were collected from the literature for the development of a classifying model that uses the consensus predictions of five random forest models. The consensus model shows excellent prediction accuracies on two independent test sets with two cutoffs of 20% and 50% for classification of molecules. The analysis of the importance of the input variables allowed the identification of the main molecular descriptors that affect the HOB class value. The model is available as a web server at www.icdrug.com/ICDrug/ADMET for quick assessment of oral bioavailability for small molecules. The results from this study provide an accurate and easy-to-use tool for screening of drug candidates based on HOB, which may be used to reduce the risk of failure in late stage of drug development.Entities:
Keywords: ADMET; Classification; Molecular descriptors; Oral bioavailability; Prediction
Year: 2022 PMID: 34991690 PMCID: PMC8740492 DOI: 10.1186/s13321-021-00580-6
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Information of the training and test datasets
| Cutoff | Data sets | Molecules | Positive | Negative |
|---|---|---|---|---|
| F = 50% | Training set | 1157 | 536 | 621 |
| Test set 1 | 290 | 169 | 121 | |
| Test set 2 | 141 | 90 | 51 | |
| F = 20% | Training set | 1142 | 859 | 283 |
| Test set 1 | 287 | 214 | 73 | |
| Test set 2 | 133 | 128 | 5 |
Fig. 1Maximum similarity of each molecule in the test sets to the molecules in the training set. The similarity was calculated using the Tanimoto coefficient similarity on the topological fingerprints of the molecules
Optimized parameters of the RF models from fivefold cross-validation on the training set when the cutoff is 50%
| Parameters | Parameters meaning | Optimal value |
|---|---|---|
| n_estimators | The number of trees in the forest | 31 |
| min_samples_leaf | The minimum number of samples required to be at a leaf node | 6 |
Performance of the RF models on the two test sets when the cutoff is 50%
| Data set | Model | SE | SP | ACC | AUC | MCC | F1-score |
|---|---|---|---|---|---|---|---|
| Test set 1 | Model 1 | 0.779 | 0.732 | 0.752 | 0.819 | 0.505 | 0.774 |
| Model 2 | 0.746 | 0.786 | 0.769 | 0.826 | 0.529 | 0.798 | |
| Model 3 | 0.713 | 0.750 | 0.734 | 0.790 | 0.460 | 0.766 | |
| Model 4 | 0.745 | 0.732 | 0.738 | 0.801 | 0.473 | 0.764 | |
| Model 5 | 0.713 | 0.768 | 0.745 | 0.813 | 0.479 | 0.777 | |
| Consensus model | |||||||
| Test set 2 | Model 1 | 0.673 | 0 | 0.816 | 0.839 | 0.596 | 0.860 |
| Model 2 | 0.673 | 0.876 | 0.801 | 0.565 | 0.848 | ||
| Model 3 | 0.577 | 0.854 | 0.752 | 0.840 | 0.452 | 0.813 | |
| Model 4 | 0.865 | 0.801 | 0.857 | 0.568 | 0.846 | ||
| Model 5 | 0.673 | 0.831 | 0.773 | 0.862 | 0.509 | 0.822 | |
| Consensus model | 0.872 |
Bold numbers refer to the maximum value (optimum value) obtained from the corresponding evaluation index
Fig. 2The AUCs of RF models on the two test sets when the cutoff is 50%
Comparison with other prediction models on test set 1 when the cutoff is 50%
| Model | Data set size | Method | ACC (test) | AUC (test) | Cut-off value |
|---|---|---|---|---|---|
| Current study | 1588 | RF | 0.793 | 0.830 | F = 50% |
| Falcón-Cano et al. [ | 1448 | CART, MLP, NB, GBT, SVM | 0.783a | 0.800a | F = 50% |
| admetSAR [ | 995 | RF | 0.697a | 0.752a | logK(%F)b = 0 (F = 50%) |
| Kim et al. [ | 995 | RF, SVM-consensus CTG | 0.76a | NA | F = 50% |
CART classification and regression trees; MLP multilayer perceptron; NB naive Bayes; GBT gradient boosted trees; SVM support vector machines
aTaken from respective references
b
Comparison with admetSAR on test sets 1 and 2 when the cutoff is 50%
| Data set | Model | SE | SP | ACC | AUC |
|---|---|---|---|---|---|
| Test set 1a | Current Study | ||||
| admetSAR | 0.784 | 0.777 | 0.780 | 0.831 | |
| Test set 2a | Current Study | 0.692 | |||
| admetSAR | 0.725 | 0.742 | 0.787 |
Bold numbers refer to the maximum value (optimum value) obtained from the corresponding evaluation index
aTest set 1 and 2 contain 168 and 66 molecules after removing the molecules used in the admetSAR training set
Optimized parameters of the RF models from fivefold cross-validation on the training set when the cutoff is 20%
| Parameters | Parameters meaning | Optimal value |
|---|---|---|
| n_estimators | The number of trees in the forest | 10 |
| min_samples_leaf | The minimum number of samples required to be at a leaf node | 6 |
Performance of the RF models on the two test sets when the cutoff is 20%
| Data set | Model | SE | SP | ACC | AUC | MCC | F1-score |
|---|---|---|---|---|---|---|---|
Test set 1 (287) | Model 1 | 0.370 | 0.869 | 0.742 | 0.736 | 0.264 | 0.824 |
| Model 2 | 0.479 | 0.864 | 0.767 | 0.767 | 0.360 | 0.847 | |
| Model 3 | 0.452 | 0.893 | 0.780 | 0.771 | 0.379 | 0.858 | |
| Model 4 | 0.452 | 0.856 | 0.746 | 0.759 | 0.308 | 0.832 | |
| Model 5 | 0.493 | 0.855 | 0.763 | 0.789 | 0.359 | 0.833 | |
| Consensus model | |||||||
Test set 2 (133) | Model 1 | 0.8 | 0.891 | 0.887 | 0.973 | 0.384 | 0.938 |
| Model 2 | 1 | 0.859 | 0.865 | 0.978 | 0.432 | 0.924 | |
| Model 3 | 0.8 | 0.883 | 0.880 | 0.939 | 0.371 | 0.933 | |
| Model 4 | 1 | 0.917 | 0.955 | 0.534 | |||
| Model 5 | 1 | 0.898 | 0.902 | 0.973 | 0.499 | 0.946 | |
| Consensus model | 1 | 0.906 | 0.951 |
Bold numbers refer to the maximum value (optimum value) obtained from the corresponding evaluation index
Comparison with ADMETlab on test sets 1 and 2 when the cutoff is 20%
| Data set | Model | SE | SP | ACC | AUC |
|---|---|---|---|---|---|
| Test set 1 | Current study | 0.493 | 0.815 | 0.801 | |
| ADMETlab | 0.855 | ||||
| Test set 2 | Current study | ||||
| ADMETlab | 0.8 | 0.844 | 0.842 | 0.902 |
Bold numbers refer to the maximum value (optimum value) obtained from the corresponding evaluation index
Fig. 3The chemical space of the training set (blue), the test set 1 (orange) and test set 2(green) using PCA factorization
Fig. 4A Importance matrix plot of the consensus model when the cutoff is 50%. B A statistical graph of the number of occurrences of the top 20 features that affect all models when the cutoff is 50%
Description of the important features in the consensus model when the cutoff is 50%
| Descriptor category | Feature name | |shap value| | Description |
|---|---|---|---|
| Estate | SsOH | 0.0239 | Sum of sOH |
| TopoPSA | TopoPSA(NO) | 0.0178 | Topological polar surface area (use only nitrogen and oxygen) |
| Autocorrelation | ATSC0c | 0.0172 | Centered Moreau–Broto autocorrelation of lag 0 weighted by Gasteiger charge |
| Autocorrelation | GATS1Z | 0.0169 | Moreau–Broto autocorrelation of lag 1 weighted by atomic number |
| Autocorrelation | ATS5pe | 0.0167 | Moreau–Broto autocorrelation of lag 5 weighted by Pauling EN |
| TopoPSA | TopoPSA | 0.0161 | Topological polar surface area |
| Autocorrelation | ATS5se | 0.0160 | Moreau–Broto autocorrelation of lag 5 weighted by Sanderson EN |
| Autocorrelation | ATS4p | 0.0150 | Moreau–Broto autocorrelation of lag 4 weighted by polarizability |
| Autocorrelation | ATS4i | 0.0144 | Moreau–Broto autocorrelation of lag 4 weighted by ionization potential |
| Autocorrelation | GATS1are | 0.0131 | Geary coefficient of lag 1 weighted by Allred–Rocow EN |
| Autocorrelation | GATS1pe | 0.0126 | Geary coefficient of lag 1 weighted by Pauling EN |
| MoeType | VSA_EState3 | 0.0122 | VSA EState Descriptor 3 (5.00 ≤ |
| Autocorrelation | ATS5are | 0.0122 | Moreau–Broto autocorrelation of lag 5 weighted by Allred–Rocow EN |
| Autocorrelation | GATS1m | 0.0122 | Geary coefficient of lag 1 weighted by mass |
| Autocorrelation | ATSC1are | 0.0115 | Centered Moreau–Broto autocorrelation of lag 1 weighted by Allred–Rocow EN |
| Autocorrelation | ATSC1c | 0.0112 | Centered Moreau–Broto autocorrelation of lag 1 weighted by Gasteiger charge |
| InformationContent | BIC0 | 0.0111 | 0-ordered bonding information content |
| Autocorrelation | GATS2Z | 0.0110 | Geary coefficient of lag 2 weighted by atomic number |
| AtomCount | nAtom | 0.0109 | Molecular ID on O atoms |
Fig. 5SHAP dependence plot of the top 20 features of the consensus model when the cutoff is 50%