Literature DB >> 22346230

A prediction model for oral bioavailability of drugs using physicochemical properties by support vector machine.

Rajnish Kumar¹, Anju Sharma, Pritish Kumar Varadwaj.

Abstract

OBJECTIVE: A computational model for predicting oral bioavailability is very important both in the early stage of drug discovery to select the promising compounds for further optimizations and in later stage to identify candidates for clinical trials. In present study, we propose a support vector machine (SVM)-based kernel learning approach carried out at a set of 511 chemically diverse compounds with known oral bioavailability values.
MATERIAL AND METHODS: For each drug, 12 descriptors were calculated. The selection of optimal hyper-plane parameters was performed with 384 training set data and the prediction efficiency of proposed classifier was tested on 127 test set data.
RESULTS: The overall prediction efficiency for the test set came out to be 96.85%. Youden's index and Matthew correlation index were found to be 0.929 and 0.909, respectively. The area under receiver operating curve (ROC) was found to be 0.943 with standard error 0.0253.
CONCLUSION: The prediction model suggests that while considering chemoinformatics approaches into account, SVM-based prediction of oral bioavailability can be a significantly important tool for drug development and discovery at a preliminary level.

Entities: Chemical Disease Gene Species

Keywords: Drugs; machine learning; oral bioavailability; prediction; support vector machine

Year: 2011 PMID： 22346230 PMCID： PMC3276008 DOI： 10.4103/0976-9668.92325

Source DB: PubMed Journal: J Nat Sci Biol Med ISSN： 0976-9668

INTRODUCTION

A drug intended for use in humans should have an ideal balance of pharmacokinetics and safety, as well as potency and selectivity. Human oral bioavailability is an important pharmacokinetic property[1] which describes the fraction of an administered drug that reaches the systematic circulation and its site of action, to exert its pharmacological and therapeutic effects. Bioavailability is 100% when a medication is administered parenterally as it goes straight into the bloodstream and is usually completely used by the body. However, when a medication is administered via other routes (such as orally), its bioavailability decreases. Prediction of oral bioavailability is not an easy task, as bioavailability depends on superposition of two processes: absorption and liver first-pass metabolism. Absorption in turn depends on solubility and permeability of compounds, as well as interactions with transporters and metabolizing enzymes in gut wall. Permeability further depends on the size of the molecule, as well as its capacity to make hydrogen bonds, its overall lipophilicity and possibly its shape and flexibility. Molecular flexibility, for example, as evaluated by counting the number of rotatable bonds, has been identified as a factor influencing bioavailability.-[2-4] The bioavailability of drugs from oral formulations is also influenced by many physiological factors including gastrointestinal fluid composition, pH and dynamics, transit and motility and transport. These factors may vary with age, gender, race, food, and disease.[5] Oral bioavailability is denoted by the letter F. To lower the attrition rate of drug development there is a need to develop strong and accurate computational methods that can predict and prioritize compounds before they are synthesized or moved towards to preclinical and clinical development.[6] Various prediction models are reported in the literature on known oral bioavailable drugs such as statistical models,[7-15] mechanistic models,[16-21] QSAR/QSPR models,[22-28] genetic programming,[29-33] artificial neural networks, machine learning classification,[34-36] etc.

MATERIALS AND METHODS

Dataset

We have selected oral bioavailability data from various literature studies.[41537-41] The whole dataset comprises of 1664 drugs. Redundancy was completely removed by manually screening and selected dataset for this study comprises of chemically diverse 511 drugs. Drugs having oral bioavailability less than 30% were regarded as low orally bioavailable drugs and drugs with oral bioavailability 30% or more were regarded as high orally bioavailable.[15] Class labels were defined as “1” for high oral bioavailability and ‘0’ for low oral bioavailability. Further the whole dataset of 511 drug molecules was randomly split into training set of 384 drugs and test set of 127 drugs. Training set was used for training various classifiers, while testing set was not exposed to the system during descriptor selection, learning, kernel selection, and hyper-parameter selection phases.

Descriptor selection

In classification problem usually the data that is to be classified is associated with a large number of features or descriptors. As a result, we get large dimension feature space, making classification a bit difficult task. So first and foremost step is to reduce the dimensions. Feature or descriptor selection is a process of identifying and removing as much of the irrelevant and redundant information as possible. The removal of irrelevant and redundant information often improves the performance of machine learning algorithms. Twelve optimal descriptors were selected using the sequential forward feature selection (SFFS) algorithm.[42] SFFS algorithm starts with an empty set of features. In first iteration, algorithm considers all feature subsets with only one feature. Feature subset with higher accuracy is used as basis of next iteration. Iteratively algorithm adds to the basis each feature which was not previously selected and retains the feature subset that results in the highest estimated performance. The search terminates after the accuracy of the current subset cannot be improved by adding any other feature. SFFS is stated as: Given a feature set X-{xi | i=1…N}, find a subset YM ={xi, …, xM}, with M < N, that optimizes an objective function J(Y). The set of optimal descriptors include molecular mass (MA), molecular surface area (MSA), molecular volume (MV), molecular refractivity (MR), total hydrogen count (HC), partition coefficient (logP), rotatable bonds (RTB), total polar surface area (TPSA), solubility index (logS), shape flexibility index (SFI) sum of E-states indices (SESI) and count of hydroxyl groups (HYG). Different feature values for dataset falls in different ranges hence to avoid the discrepancy we have further scaled down these numeric values between –1 to 1. Such scaling facilitates better representation of feature values in kernel function and also avoid numerical difficulties during the calculation.

Support vector machine description

In this process, input vector for training as well as the test set has been quantified as Xi =(Xi1, Xi2,………, Xi13), each labeled by corresponding y = 0 or y = 1 depending on whether it represents high orally bioavailable drug or low orally bioavailable drug, respectively. Training set was then subjected to the support vector machine (SVM) classifier, which involved fixing several hyper-parameters which further determines the function optimized by SVM. It is extremely crucial and has a profound impact on the performance of trained classifier. We used several kernels: linear, polynomials, and radial bias function (RBF) initially to determine which of them is applicable to our data and is able to classify it efficiently.[43] We found RBF as the suitable classifier function (as the number of features was not very large in comparison to the dataset) for which training errors on low oral bioavailability data (false negatives) outweigh errors on high oral bioavailability data (false positives). K(Xi, Xj) = exp (–γ | | Xi – Xj | |2) …….(1) where γ > 0. This kernel (1) is best for the data in which the class-conditional probability distribution function approaches the Gaussian distribution. It maps the non-linear data into a higher dimension space where data is linearly separable. Its exponential nature can be expanded into an infinite series, giving rise to an infinite-dimension polynomial kernel. However, this kernel is bit difficult to design, in the sense that it is difficult to arrive at an optimum “γ” and choose the corresponding C that works best for a given problem. This has been taken care by running grid parameter search exploring all combinations of C and γ with each cross-validation routine, where γ ranged from 2–15 to 2 4 and C ranged from 2–5 to 215.[44] To identify an optimal hyper-parameter set we have performed a two-step grid-search on C and γ with the use of 10 folds cross-validation, by dividing training set into 10 subsets of equal size (~38 drugs each having 12 descriptors). Iteratively each subset is tested using the classifier trained on the remaining nine subsets. Pairs of (C; γ) have been tried and the one with the best cross-validation accuracy has been picked. Using RBF kernel, the best cross-validation accuracy was obtained at γ = 0.0078125 and C = 512. The result obtained showed a good classification accuracy of 88.54% during the cross-validation. Adopted methodology for model generation is illustrated in Figure 1.

Figure 1

Stepwise illustration of generation of an oral bioavailability prediction model and its application

RESULTS

To optimize the SVM parameters γ and C, 10-fold cross-validation was applied on each of the training datasets bin, exploring various combinations of C (2–5 to 215) and γ (2–15 to 24). In 10-fold cross-validation, the training dataset (384 drugs, each having 12 descriptors) was spilt into 10 subsets, each of equal size, where one of such subsets was used as the test dataset while the other subsets were used for training the classifier. The process is repeated 10 times using a different subset of a corresponding test and training datasets, hence ensuring that all subsets are used for both training and testing. A twofold grid optimization has been considered and the result shown [Figure 2] suggests that the optimized C and γ were found to be 512 and 0.0078125, respectively.

Figure 2

Contour plot of grid search result showing optimum values of hyper-parameter

Contour plot of grid search result showing optimum values of hyper-parameter The best combination of γ and C that was obtained from grid based optimization is used for training a RBF-based SVM classifier using entire training data (384 drugs each having 12 descriptors). The result obtained showed a good classification accuracy of 88.54% during the cross-validation. The reported accuracy on the training datasets depicts the effectiveness and reliability of this prediction method; but still it may or may not give the equivalent or better accuracy when applied on the novel drugs, i.e. drugs with an unknown oral bioavailability profile. Therefore, it is extremely important to test the SVM classifier on the non-cross validated test set which is out-of-sample and independent of the training set data. We applied the SVM classifier on the whole test set (127 ligands each having 12 descriptors), the classifier incurred an accuracy of 96.85% by using the RBF kernel with γ = 0.0078125 and C = 512. This prediction accuracy suggests that SVM-based prediction of oral bioavailability can be considered as a helpful tool in drug discovery and development. The efficiency of a classifier was further evaluated with the help of various quantitative variable: (a) true positive (TP), represents total number of correctly classified high orally bioavailable drugs, (b) true negative or (TN), represents total number of correctly classified low orally bioavailable drugs (c) false positive (FP), represents total number of incorrectly classified low orally bioavailable drugs, (d) false negatives (FN), represents total number of incorrectly classified high orally bioavailable drugs. Using these quantitative variables, several statistical metrics were calculated to measure the effectiveness of the oral bioavailability-SVM classifier. Sensitivity (Sn) and specificity (Sp) metrics, which indicate the ability of a prediction system to classify the high and low orally bioavailable drugs, were calculated by equations (2) and (3) and receiver operating characteristic curve (ROC) for the same was plotted [Figure 3].

Figure 3

Receiver operating characteristic (ROC) plot for a classifier with optimized values of C and γ

Receiver operating characteristic (ROC) plot for a classifier with optimized values of C and γ Sn (%) = [TP/ (TP+FN)]*100 …….(2) Sn (%) = [TN/(TN+FP)*100] …….(3) To indicate an overall performance of the classifier system; accuracy (Ac), for the percentage of correctly classified drugs and the Matthews correlation coefficient (MCC) were computed as follows: Ac = [(TP + TN)/ (TP+FP+TN+FN)]*100 …….(4) MCC= [(TP*TN)-(FP*FN)]/√ (TN+FP) (TN+FN) (TP+FP) (TP+FN) …….(5) Sensitivity (Sn) came out to be 95.60% with a false positive proportion (FP) of 0.79% whereas specificity (Sp) came out to be 97.30% with a false negative (FN) proportion of 3.15%. Similarly Youden's Index (Youden's Index = sensitivity + specificity – 1) was 0.929 and Matthews correlation coefficient (MCC) was found to be 0.909. The overall accuracy (Ac) calculated using equation (4) was 96.1% which is significantly higher than existing methods. The area under ROC curve was found to be 0.943 with a standard error of 0.0253.

DISCUSSION

The prediction model derived from SVM can serve as primary tool for generating some idea about oral bioavailability of ligands. User just needs to calculate the 12 physicochemical descriptors, as these values are prerequisite for prediction of oral bioavailability through the generated SVM model [Figure 1]. The ligand with unknown oral bioavailability can be tested against the prediction model. For given 12 physicochemical properties this SVM model can predict the oral bioavailability of the ligand under consideration. At preliminary level, this model can predict that whether the oral bioavailability of the ligand under study is low or high. Numerous attempt have been made to predict oral bioavailability of drugs and ligands by computational and experimental method in past. Some of those prediction models are listed in table 1 along with the current study and the model generated by SVM seems to be more satisfactory in terms of prediction accuracy.

Table 1

Comparative study of some of the oral bioavailability prediction models with current study

Comparative study of some of the oral bioavailability prediction models with current study Absorption of drug taken orally is a complex process and, although related to drug physicochemical properties, it is related in fairly complex ways. Physiological and environmental conditions influence the bioavailability of drugs such as the presence or the absence of food, residence time of the drug in contact with the small intestinal epithelium, etc and make the absorption prediction further complex.[45] Failure to appreciate this complexity in attempting to build models may lead to the generation of model with low confidence. An alternative approach to modeling oral bioavailability is to develop structure-based models for the properties contributing to the absorption process, such as solubility and permeability (included in presented model as logS and logP). These can then be used to identify opportunities for optimization. For example, if a potential drug is expected to have poor oral bioavailability due to low-intrinsic aqueous solubility, then this is a property amenable to manipulation by the formulation scientist. On the other hand, if the compound is both poorly soluble and permeable, along with a significant metabolic liability, optimization may be very difficult if not impossible. Such candidates present high risks to successful development and should be identified as such early in the drug identification and development process. Judicious development and use of computational models will clearly aid in these processes.[4647]

CONCLUSION

The SVM classifier with radial basis function kernel with γ = 0.0078125 and C = 512 applied on the test datasets. The overall accuracy of the model obtained is 96.85%. It suggests that while considering chemoinformatics approaches into account, SVM-based prediction of oral bioavailability can be a significantly important tool for drug development and discovery at a preliminary level.

34 in total

Review 1. Computational methods to estimate drug development parameters.

Authors: B L Podlogar; I Muegge; L J Brice
Journal: Curr Opin Drug Discov Devel Date: 2001-01

Review 2. ADMET in silico modelling: towards prediction paradise?

Authors: Han van de Waterbeemd; Eric Gifford
Journal: Nat Rev Drug Discov Date: 2003-03 Impact factor: 84.694

3. Characteristic physical properties and structural fragments of marketed oral drugs.

Authors: Michal Vieth; Miles G Siegel; Richard E Higgs; Ian A Watson; Daniel H Robertson; Kenneth A Savin; Gregory L Durst; Philip A Hipskind
Journal: J Med Chem Date: 2004-01-01 Impact factor: 7.446

4. The prediction of drug metabolism, tissue distribution, and bioavailability of 50 structurally diverse compounds in rat using mechanism-based absorption, distribution, and metabolism prediction tools.

Authors: Stefan S De Buck; Vikash K Sinha; Luca A Fenu; Ron A Gilissen; Claire E Mackie; Marjoleen J Nijsen
Journal: Drug Metab Dispos Date: 2007-01-31 Impact factor: 3.922

5. Prediction of absolute bioavailability for drugs using oral and renal clearance following a single oral dose: a critical view.

Authors: I Mahmood
Journal: Biopharm Drug Dispos Date: 1997-08 Impact factor: 1.627

6. Prediction of oral drug absorption in humans by theoretical passive absorption model.

Authors: Kouki Obata; Kiyohiko Sugano; Ryoichi Saitoh; Atsuko Higashida; Yoshiaki Nabuchi; Minoru Machida; Yosinori Aso
Journal: Int J Pharm Date: 2005-04-11 Impact factor: 5.875

7. Prediction of bioavailability for drugs with a high first-pass effect using oral clearance data.

Authors: A Somogyi; M Eichelbaum; R Gugler
Journal: Eur J Clin Pharmacol Date: 1982 Impact factor: 2.953

8. Predicting human oral bioavailability of a compound: development of a novel quantitative structure-bioavailability relationship.

Authors: C W Andrews; L Bennett; L X Yu
Journal: Pharm Res Date: 2000-06 Impact factor: 4.200

9. ADME evaluation in drug discovery. 6. Can oral bioavailability in humans be effectively predicted by simple molecular property-based rules?

Authors: Tingjun Hou; Junmei Wang; Wei Zhang; Xiaojie Xu
Journal: J Chem Inf Model Date: 2007 Mar-Apr Impact factor: 4.956

10. A physiological model for the estimation of the fraction dose absorbed in humans.

Authors: Stefan Willmann; Walter Schmitt; Jörg Keldenich; Jörg Lippert; Jennifer B Dressman
Journal: J Med Chem Date: 2004-07-29 Impact factor: 7.446

1 in total

1. Chemometric Evaluation of THz Spectral Similarity for the Selection of Early Drug Candidates.

Authors: Lukasz A Sterczewski; Kacper Nowak; Boguslaw Szlachetko; Michal P Grzelczak; Berenika Szczesniak-Siega; Stanislawa Plinska; Wieslaw Malinka; Edward F Plinski
Journal: Sci Rep Date: 2017-11-06 Impact factor: 4.379

1 in total