| Literature DB >> 25767566 |
Alex A Freitas1, Kriti Limbu2, Taravat Ghafourian3.
Abstract
BACKGROUND: Volume of distribution is an important pharmacokinetic property that indicates the extent of a drug's distribution in the body tissues. This paper addresses the problem of how to estimate the apparent volume of distribution at steady state (Vss) of chemical compounds in the human body using decision tree-based regression methods from the area of data mining (or machine learning). Hence, the pros and cons of several different types of decision tree-based regression methods have been discussed. The regression methods predict Vss using, as predictive features, both the compounds' molecular descriptors and the compounds' tissue:plasma partition coefficients (Kt:p) - often used in physiologically-based pharmacokinetics. Therefore, this work has assessed whether the data mining-based prediction of Vss can be made more accurate by using as input not only the compounds' molecular descriptors but also (a subset of) their predicted Kt:p values.Entities:
Keywords: ADME; Data mining; Decision tree; Machine learning; Pharmacokinetics; QSAR; QSPkR; Tissue partition; Volume of distribution
Year: 2015 PMID: 25767566 PMCID: PMC4356883 DOI: 10.1186/s13321-015-0054-x
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Figure 1Graphical summary of our two-phase approach for predicting log Vss. First, 13 log Kt:p’s (one for each tissue) are the target variables to be predicted from molecular descriptors. Then, the predicted log Kt:p values from these models together with molecular descriptors are used as descriptors to build models predicting log Vss.
Mean Absolute Error (calculated by 10-fold cross-validation) in the prediction of log K value by different decision tree-based regression methods, for each tissue, in K -target dataset (with 110 compounds)
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|
| muscle | 0.3029 | 0.3040 | 0.3164 |
| 0.3384 | 0.3857 | 0.3172 | 0.3228 |
| brain | 0.4695 | 0.4797 | 0.4869 | 0.4768 | 0.5387 | 0.5347 | 0.5170 |
|
| intestine | 0.5939 | 0.4700 | 0.5337 | 0.5245 |
| 0.4698 | 0.6060 | 0.5361 |
| lung |
| 0.4812 | 0.4755 | 0.4776 | 0.5710 | 0.5522 | 0.5114 | 0.4822 |
| spleen | 0.9824 | 1.0069 | 1.0276 | 1.0206 | 0.8149 | 0.7997 | 0.8783 |
|
| heart | 0.3391 | 0.3076 |
| 0.3321 | 0.3562 | 0.4116 | 0.3723 | 0.3358 |
| skin | 0.2844 | 0.2752 | 0.2954 |
| 0.3436 | 0.3433 | 0.2782 | 0.2961 |
| bone | 0.6210 | 0.7101 | 0.5893 | 0.5735 | 0.5109 | 0.5847 | 0.5724 |
|
| adipose | 0.3581 | 0.4106 | 0.4186 | 0.4005 | 0.4557 | 0.5268 | 0.4031 |
|
| kidneys | 0.2848 | 0.2706 | 0.2756 | 0.2736 | 0.3226 | 0.4035 | 0.2913 |
|
| liver | 0.5495 | 0.5634 | 0.5036 | 0.5091 | 0.4925 | 0.5569 | 0.5879 |
|
| gut | 0.6484 | 0.6134 | 0.6056 | 0.4905 | 0.4695 | 0.3912 | 0.4135 |
|
| thymus | 0.3943 | 0.3599 | 0.3667 | 0.3601 | 0.2633 |
| 0.3529 | 0.2623 |
Mean Absolute Error (MAE) – calculated by 10-fold cross-validation – in the prediction of log Vss by each regression method, in model selection dataset (with 402 compounds)
|
|
| ||
|---|---|---|---|
|
|
|
| |
| M5P-4 | 0.3891 | 0.3698 | 0.3739 |
| M5P-6 | 0.3823 | 0.3665 | 0.4003 |
| M5P-8 | 0.4715 | 0.3616 | 0.3847 |
| M5P-10 | 0.4751 | 0.3658 | 0.3763 |
| M5P-RegTree |
| 0.3772 | 0.3782 |
| REPTree | 0.3824 | 0.4330 | 0.3974 |
| M5Rules | 0.3911 | 0.3843 | 0.3836 |
| Bagging (M5P) | 0.3713 |
|
|
Mean Absolute Error (MAE) and Geometric Mean Fold Error (GMFE) calculated for each combination of input descriptor set and the best regression model for that descriptor set, when predicting log Vss for all compounds in the external set (with 202 compounds)
|
|
|
|
|
|
|---|---|---|---|---|
| 1 | predicted log Kt:p’s and molecular descriptors | M5P – regression tree | 0.4172 | 2.61 |
| 2 | molecular descriptors only (i.e., without predicted log Kt:p’s) | Bagging (a set of M5P model trees) | 0.3676 | 2.33 |
| 3 | descriptor set selected by genetic search-based CFS | Bagging (a set of M5P model trees) | 0.3609 | 2.29 |
regression rules with coverage ≥ 10 extracted from the regression tree built by M5P when using as input descriptors both the predicted log K for 12 tissues and a large set of molecular descriptors
|
|
|
|
|
|---|---|---|---|
| 1 | log Kskin:plasma ≤ 0.044 | 0.57 | 80 |
| 2 | log Kskin:plasma ≤ 0.044 | 0.37 | 30 |
| 3 | log Kskin:plasma ≤ 0.044 | 0.27 | 60 |
| 4 | log Kskin:plasma > 0.044 | 1.66 | 13 |
| 5 | log Kskin:plasma > 0.044 | 1.96 | 26 |
| 6 | log Kskin:plasma > 0.044 | 3.14 | 20 |
| 7 | log Kskin:plasma > 0.044 | 2.15 | 19 |
| 8 | log Kskin:plasma > 0.044 | 1.10 | 100 |
Most relevant descriptors occurring in the set of 10 model trees produced by Bagging M5P to predict log Vss when using as input only the descriptors selected by genetic search-based CFS
|
|
|
|
|---|---|---|
| log Kadipose_tissue:plasma | 10 | 0 |
| PEOE_VSA+0 | 0 | 3 |
| Log P | 0 | 2 |
| TPSA | 0 | 2 |
| C_ratio | 0 | 2 |
| Vsurf_ID6 | 0 | 2 |
Figure 2Observed vs predicted log Vss for the external validation set using the model built by Bagging M5P from the descriptors selected by CFS; outliers have been identified by empty circles.