| Literature DB >> 22595422 |
Vasanthanathan Poongavanam1, Norbert Haider, Gerhard F Ecker.
Abstract
P-Glycoprotein (P-gp, ABCB1) plays a significant role in determining the ADMET properties of drugs and drug candidates. Substrates of P-gp are not only subject to multidrug resistance (MDR) in tumor therapy, they are also associated with poor pharmacokinetic profiles. In contrast, inhibitors of P-gp have been advocated as modulators of MDR. However, due to the polyspecificity of P-gp, knowledge on the molecular basis of ligand-transporter interaction is still poor, which renders the prediction of whether a compound is a P-gp substrate/non-substrate or an inhibitor/non-inhibitor quite challenging. In the present investigation, we used a set of fingerprints representing the presence/absence of various functional groups for machine learning based classification of a set of 484 substrates/non-substrates and a set of 1935 inhibitors/non-inhibitors. Best models were obtained using a combination of a wrapper subset evaluator (WSE) with random forest (RF), kappa nearest neighbor (kNN) and support vector machine (SVM), showing accuracies >70%. Best P-gp substrate models were further validated with three sets of external P-gp substrate sources, which include Drug Bank (n = 134), TP Search (n = 90) and a set compiled from literature (n = 76). Association rule analysis explores the various structural feature requirements for P-gp substrates and inhibitors.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22595422 PMCID: PMC3445814 DOI: 10.1016/j.bmc.2012.03.045
Source DB: PubMed Journal: Bioorg Med Chem ISSN: 0968-0896 Impact factor: 3.641
Number of compounds in the training and test set for inhibitor and substrate models
| Models | Training set | Test set | Sum | ||
|---|---|---|---|---|---|
| Substrate | 142 | 140 | 101 | 101 | 484 |
| Inhibitor | 881 | 387 | 399 | 268 | 1935 |
P+: substrate or inhibitor.
P−: non-substrate or non-inhibitor.
Figure 1Scoring plot of the first two principal components for substrates and non-substrates in the training and test set.
Figure 2Frequency distribution of functional groups for substrate (A) and inhibitor (B) models. (For inhibitor frequency plot, the functional groups, which have frequency <5% are not shown for clarity).
Accuracies of the models for substrates and non-substrate using supervised classifiers
| Data set | Methods | Confusion matrix | Sensitivity | Specificity | G-mean | MCC | Accuracy | |||
|---|---|---|---|---|---|---|---|---|---|---|
| TP | FN | TN | FP | |||||||
| 10-Fold | kNN | 188 | 55 | 167 | 74 | 0.77 | 0.69 | 0.73 | 0.47 | 0.73 |
| SVM | 152 | 91 | 159 | 82 | 0.63 | 0.66 | 0.64 | 0.29 | 0.64 | |
| RF | ||||||||||
| Test set | kNN | 75 | 26 | 60 | 41 | 0.74 | 0.59 | 0.66 | 0.34 | 0.67 |
| SVM | 67 | 43 | 57 | 44 | 0.61 | 0.56 | 0.59 | 0.17 | 0.59 | |
The bold letters indicate the best performing model.
Abbreviations: kNN, kappa nearest neighbor; SVM, support vector machine; RF, random forest; TP, true positive; FN, false negative; TN, true negative; FP, false positive; MCC, Matthews correlation coefficient.
Whole data set was used for 10-fold cross validation.
Performance of the substrate prediction model for external test sets; A: TP search, B: Drug Bank, C: Wang et al.
| Data set | Compounds | Sensitivity | Specificity | Overall accuracy | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| SVM | KNN | RF | SVM | KNN | RF | SVM | KNN | RF | ||
| A | 90 | 13 | 14 | — | — | — | — | — | — | |
| B | 134 | 30 | 64 | — | — | — | — | — | — | |
| C | 76 | 28 | 25 | 91 | 91 | 47 | 45 | |||
The bold letters indicate the best performing model.
Only substrates are available.
Figure 3Analysis of the external test sets: (a) scoring plot of the first two principal components, (b) scoring plot of the first two principal components for the external test sets (TP search, Drug Bank, literature compounds), (c) comparison of functional group frequency (bins) for substrates between different data source, (d) comparison of functional group frequency (bins) for non-substrates between different data sources.
Accuracies of the models for inhibitor/non-inhibitor using supervised classifiers
| Models | Methods | Confusion matrix | Sensitivity | Specificity | G-mean | MCC | Accuracy | |||
|---|---|---|---|---|---|---|---|---|---|---|
| TP | FN | TN | FP | |||||||
| 10-Fold | kNN | 1153 | 127 | 378 | 277 | 0.90 | 0.58 | 0.72 | 0.51 | 0.79 |
| SVM | 1153 | 127 | 307 | 348 | 0.90 | 0.47 | 0.65 | 0.42 | 0.75 | |
| Test set | kNN | 345 | 54 | 142 | 126 | 0.86 | 0.53 | 0.68 | 0.42 | 0.73 |
| SVM | 345 | 54 | 129 | 139 | 0.86 | 0.48 | 0.65 | 0.38 | 0.71 | |
The bold letters indicate the best performing model.
Abbreviations: kNN, kappa nearest neighbor; SVM, support vector machine; RF, random forest; TP, true positive; FN, false negative; TN, true negative; FP, false positive; MCC, Matthews correlation coefficient.
Whole data set was used for 10-fold cross validation.
Figure 4Selected P-glycoprotein inhibitors; atoms are marked according to association rules for inhibitors. ether: arrow, tertiary amine: dotted circle.
Figure 5Schematic representation of the principle of the frequent pattern growth algorithm (FPGrowth).