| Literature DB >> 33430990 |
Dejun Jiang1, Tailong Lei1, Zhe Wang1, Chao Shen1, Dongsheng Cao2, Tingjun Hou3.
Abstract
Breast cancer resistance protein (BCRP/ABCG2), an ATP-binding cassette (ABC) efflux transporter, plays a critical role in multi-drug resistance (MDR) to anti-cancer drugs and drug-drug interactions. The prediction of BCRP inhibition can facilitate evaluating potential drug resistance and drug-drug interactions in early stage of drug discovery. Here we reported a structurally diverse dataset consisting of 1098 BCRP inhibitors and 1701 non-inhibitors. Analysis of various physicochemical properties illustrates that BCRP inhibitors are more hydrophobic and aromatic than non-inhibitors. We then developed a series of quantitative structure-activity relationship (QSAR) models to discriminate between BCRP inhibitors and non-inhibitors. The optimal feature subset was determined by a wrapper feature selection method named rfSA (simulated annealing algorithm coupled with random forest), and the classification models were established by using seven machine learning approaches based on the optimal feature subset, including a deep learning method, two ensemble learning methods, and four classical machine learning methods. The statistical results demonstrated that three methods, including support vector machine (SVM), deep neural networks (DNN) and extreme gradient boosting (XGBoost), outperformed the others, and the SVM classifier yielded the best predictions (MCC = 0.812 and AUC = 0.958 for the test set). Then, a perturbation-based model-agnostic method was used to interpret our models and analyze the representative features for different models. The application domain analysis demonstrated the prediction reliability of our models. Moreover, the important structural fragments related to BCRP inhibition were identified by the information gain (IG) method along with the frequency analysis. In conclusion, we believe that the classification models developed in this study can be regarded as simple and accurate tools to distinguish BCRP inhibitors from non-inhibitors in drug design and discovery pipelines.Entities:
Keywords: ADMET; Breast cancer resistance protein; Deep learning; Ensemble learning; Extreme gradient boosting; Machine learning; Multi-drug resistance
Year: 2020 PMID: 33430990 PMCID: PMC7059329 DOI: 10.1186/s13321-020-00421-y
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1The workflow of QSAR modeling
Fig. 2The chemical diversity of the final dataset evaluated by the Tanimoto similarities between any two of the compounds. The Tanimoto index was calculated based on the 881 PubChem fingerprints
Fig. 3Distributions of the eight simple molecular descriptors, including SlogP, logS, a_acc, a_don, b_rotN, KierFlex, a_aro, and b_aro for the BCRP inhibitors and non-inhibitors
The mean values of eight simple descriptors for inhibitors and non-inhibitors and the associated p-values determined by Mann–Whitney U-test for each descriptor
| Class | SlogP | logS | a_acc | a_don | b_rotN | KierFlex | a_aro | b_ar |
|---|---|---|---|---|---|---|---|---|
| Inhibitors | 4.26 | − 6.07 | 4.10 | 1.19 | 6.09 | 4.36 | 17.51 | 18.01 |
| Non-inhibitors | 3.29 | − 5.25 | 3.70 | 1.41 | 5.88 | 4.53 | 13.46 | 13.73 |
| 2.16 × 10−50 | 4.63 × 10−27 | 1.32 × 10−09 | 1.12 × 10−08 | 6.38 × 10−01 | 3.14 × 10−04 | 6.90 × 10−75 | 5.90 × 10−80 |
Fig. 4The feature selection process of molecular features through the rfSA method. Internal: four fifth of the training set in the fivefold cross-validation with five repetitions; External: one-fifth of the training set in the fivefold cross-validation with five repetitions
Fig. 5The chemical space distributions based on the principal components analysis (PCA): a The distributions of PC1 and PC2; b The distributions of PC1 and PC3; c The distributions of PC2 and PC3. d The chemical space defined by molecular weight as X-axis and SlogP as Y-axis
Statistical results of the seven classification models based on 144 molecular features for the training (random fivefold cross-validation) and test sets
| Training set (random fivefold cross-validation) | Test set | |||||||
|---|---|---|---|---|---|---|---|---|
| GA | BA | MCC | AUC | GA | BA | MCC | AUC | |
| SVM | 0.902 | 0.893 | 0.793 | 0.947 | 0.911 | 0.905 | 0.812 | 0.958 |
| DNN | 0.894 | 0.892 | 0.780 | 0.950 | 0.907 | 0.904 | 0.806 | 0.960 |
| XGBoost | 0.902 | 0.894 | 0.793 | 0.956 | 0.891 | 0.883 | 0.770 | 0.957 |
| SGB | 0.901 | 0.894 | 0.792 | 0.952 | 0.886 | 0.879 | 0.759 | 0.958 |
| RLR | 0.875 | 0.872 | 0.740 | 0.932 | 0.873 | 0.867 | 0.734 | 0.936 |
| k-NN | 0.863 | 0.862 | 0.717 | 0.913 | 0.857 | 0.856 | 0.705 | 0.917 |
| NB | 0.826 | 0.834 | 0.654 | 0.898 | 0.780 | 0.793 | 0.572 | 0.888 |
| Consensus1 | 0.902 | 0.893 | 0.793 | NA | 0.903 | 0.897 | 0.797 | NA |
| Consensus2 | 0.901 | 0.895 | 0.793 | 0.956 | 0.909 | 0.903 | 0.808 | 0.963 |
NA not available
Fig. 6The residual (binary cross entropy) distribution plots of the a fivefold cross-validated training set and b test set (b) for the four well-performing QSAR models (DNN, SGB, XGBoost, and SVM). X axis represents the value of binary cross entropy, and Y axis represents the ratio of number of compounds with the residuals higher than a value to the total number of compounds in the training or test set
Statistical results of the seven classification models for the cluster fivefold cross-validated training set
| Training set (cluster fivefold cross-validation) | ||||
|---|---|---|---|---|
| GA | BA | MCC | AUC | |
| SVM | 0.910 | 0.901 | 0.810 | 0.954 |
| DNN | 0.905 | 0.897 | 0.794 | 0.949 |
| XGBoost | 0.905 | 0.897 | 0.800 | 0.959 |
| SGB | 0.901 | 0.892 | 0.791 | 0.956 |
| RLR | 0.879 | 0.876 | 0.748 | 0.935 |
| k-NN | 0.867 | 0.863 | 0.722 | 0.918 |
| NB | 0.819 | 0.823 | 0.634 | 0.896 |
The number of compounds that are inside and outside the AD determined by the non-parametric probability density distribution-based method in the training and test sets
| Dataset | Inside AD | Outside AD | AD coverage (%) | ||
|---|---|---|---|---|---|
| N1a | N2b | N1a | N2b | ||
| Training set | 879 | 1361 | 0 | 0 | 100 |
| Test set | 219 | 340 | 1 | 10 | 98 |
aN1: the number of inhibitors
bN2: the number of non-inhibitors
The reported classification models for BCRP inhibitors and non-inhibitors
| Year | Data size | Data set | Method | Descriptors | Model validation | Statistical results | Refs. | |
|---|---|---|---|---|---|---|---|---|
| Training | Test | |||||||
| 2007 | 123 | 80 | 43 | OPLS-DA | Descriptors from SELMA software package | Y-rand | GATE = 0.79 | Matsson et al. [ |
| 2009 | 122 | 83 | 39 | PLS-DA | Descriptors from DragonX version 3.0 | Y-rand | aNA | Matsson et al. [ |
| 2013 | 109 | 30 | 79 | Pharmacophore modeling | NA | NA | MCCTE = 0.29, GATE = 0.66 | Pan et al. [ |
| 2013 | 203 | 124 | 79 | NB | ECFP_6, FCFP_6 fingerprints | LOO CV | AUCTR(LOO CV) = 0.795, MCCTE = 0.69 | Pan et al. [ |
| 2013 | 382 | 382 | NA | SVM, k-NN, RF, and consensus modeling | Dragon, MOE descriptors | Fivefold CV, Y-rand | BATR(fivefold cv) = 0.83 ± 0.04 (Consensus) | Sedykh et al. [ |
| 2014 | 275 | 96 | Test: 32, external set: 147 | ensembles of ANN, ensembles of SVM | Descriptors from ADMET Modeler | NA | GATE = 0.87, GAExternal = 0.67 (ensembles of ANN) | Eric et al. [ |
| 2014 | 780 | 780 | NA | NB | ECFP_6 fingerprints | Tenfold CV | GATR(tenfold CV) = 0.919, AUCTR(tenfold cv) = 0.854 | Montanari et al. [ |
| 2015 | 394 | 197 | Test: 99, external set: 98 | SVM, k-NN, ANN, and Consensus Modeling | Dragon descriptors | NA | GATE = 0.878, MCCTE = 0.73; GAExternal = 0.745, MCCExternal = 0.46 (ANN) | Belekar et al. [ |
| 2016 | aNA | NA | NA | GTM-kNNd, GTM-Bayes, RF, SVM, and k-NN | MOE descriptors | Fivefold CV with five repetitions | NA | Gimadiev et al. [ |
| 2017 | 978 | 978 | NA | NB, LR, SVM, and RF | MACCS, Morgan, ECFP8 fingerprints, VolSurf descriptors | Tenfold CV, leave-sources-out validation | MCCTR(tenfold CV) = 0.65, AUCTR(tenfold CV) = 0.90 (LR) | Montanari et al. [ |
| 2019 | 2799 | 2240 | 559 | NB, LR, SVM, k-NN, XGBoost, SGB, DNN and consensus modeling | MOE descriptors and Pubchem fingerprints | Fivefold CV | MCCTE = 0.812, AUCTE = 0.958, GATE = 0.911, BATE = 0.905 (SVM) | This study |
Mean ± st.dev across fivefold CV
TR training set, TE test set, OPLS-DA orthogonal partial least-squares projection to latent structures discriminant analysis, NA not available, GA global accuracy, Y-Rand Y-Randomization test, PLS-DA partial least-squares projection to latent structures discriminant analysis, NB Naive Bayes, LOO CV leave-one-out cross-validation, AUC the area under the receiver operating characteristic curve, MCC Matthews correlation coefficient, SVM support vector machine, k-NN k-nearest neighbors, RF random forest, CV cross-validation, BA balanced accuracy, ANN artificial neural networks, GTM generative topographic mapping, LR logistic regression
There are many models developed based on different methods or descriptors, and we only extracted the best statistical results for the test set or cross-validation
aThe exact values are not available in the publication
Fig. 7The top ten descriptors/fingerprints for the a DNN, b XGBoost, c SGB and d SVM models identified by the perturbation-based model-agnostic method with 10 repetitions. The term “full_model_” denotes the estimation of a model performance on the test set based on the cross entropy loss function
Fig. 8The information gain (IG) value distributions of the generated fragments
The identified fragments by the information gain (IG) method couple with frequency analysis for BCRP inhibitors
| Fragment | Name | Count1a | Count2b | F1c | F2d | IG |
|---|---|---|---|---|---|---|
| Quinazoline | 175 | 14 | 2.3604 | 0.1219 | 0.0665 | |
| 92 | 0 | 2.5492 | 0.0000 | 0.0456 | ||
| Pyrazolo[1,5-a]pyrimidine | 85 | 8 | 2.3299 | 0.1415 | 0.0299 | |
| 4H-chromene | 104 | 23 | 2.0875 | 0.2980 | 0.0263 | |
| 2-phenyl-4H-chromene | 55 | 3 | 2.4173 | 0.0851 | 0.0216 | |
| 2H-tetrazole | 65 | 12 | 2.1519 | 0.2564 | 0.0177 | |
| Piperazine | 90 | 30 | 1.9119 | 0.4114 | 0.0171 | |
| 31 | 0 | 2.5492 | 0.0000 | 0.0151 | ||
| 2,3,4,9-tetrahydro-1H-pyrido[3,4-b]indole | 29 | 0 | 2.5492 | 0.0000 | 0.0141 | |
| (R)-2-benzyl-1-phenyl-2,3,4,9-tetrahydro-1H-pyrido[3,4-b]indole | 27 | 0 | 2.5492 | 0.0000 | 0.0131 | |
| 41 | 5 | 2.2721 | 0.1789 | 0.0131 | ||
| Pyrido[2,3-d]pyrimidine | 32 | 2 | 2.3992 | 0.0968 | 0.0122 |
aCount1 denotes the number of inhibitors containing the fragment
bCount2 denotes the number of non-inhibitors containing the fragment
cF1 denotes the fragment frequency in the inhibitor class
dF2 denotes the fragment frequency in the non-inhibitor class
The identified fragments by the information gain (IG) method couple with frequency analysis for BCRP non-inhibitors
| Fragment | Name | Count1a | Count2b | F1c | F2d | IG |
|---|---|---|---|---|---|---|
| 1H-pyrazole | 1 | 67 | 0.0375 | 1.6213 | 0.0153 | |
| Thiazolidine | 0 | 56 | 0.0000 | 1.6455 | 0.0146 | |
| Thiophene | 16 | 118 | 0.3044 | 1.4490 | 0.0133 | |
| Morpholine | 0 | 51 | 0.0000 | 1.6455 | 0.0132 | |
| 4H-1,2,4-triazole | 0 | 48 | 0.0000 | 1.6455 | 0.0125 | |
| Thiazole | 3 | 62 | 0.1177 | 1.5696 | 0.0113 | |
| Benzo[d]thiazole | 1 | 50 | 0.0500 | 1.6132 | 0.0109 | |
| 1,3,4-thiadiazole | 0 | 36 | 0.0000 | 1.6455 | 0.0093 | |
| Hexahydropyrimidine | 2 | 44 | 0.1108 | 1.5740 | 0.0081 | |
| Piperidine | 3 | 46 | 0.1561 | 1.5448 | 0.0075 | |
| Pyrrolidine | 1 | 36 | 0.0689 | 1.6010 | 0.0074 |
aCount1 denotes the number of inhibitors containing the fragment
bCount2 denotes the number of non-inhibitors containing the fragment
cF1 denotes the fragment frequency in the inhibitor class
dF2 denotes the fragment frequency in the non-inhibitor class
The representative fragments whose counts are equal or larger than 2 in the 50 misclassified compounds
| Fragment | Name | Count1a | Count2a | Count3a | OR1a (%) | OR2a (%) | OR3a (%) |
|---|---|---|---|---|---|---|---|
| Pyridine | 125 | 43 | 5 | 5.58 | 7.69 | 10.00 | |
| Tetrahydro-2H-pyran | 37 | 6 | 3 | 1.65 | 1.07 | 6.00 | |
| Adamantane | 6 | 3 | 2 | 0.27 | 0.54 | 4.00 | |
| (E)-prop-1-ene-1,3-diyldibenzene | 58 | 16 | 3 | 2.59 | 2.86 | 6.00 | |
| 1H-indole | 82 | 25 | 3 | 3.66 | 4.47 | 6.00 | |
| 4H-chromene | 110 | 17 | 4 | 4.91 | 3.04 | 8.00 | |
| Furan | 175 | 37 | 2 | 7.81 | 6.62 | 4.00 | |
| Piperazine | 98 | 22 | 3 | 4.38 | 3.94 | 6.00 | |
| Quinazoline | 150 | 39 | 2 | 6.70 | 6.98 | 4.00 | |
| Quinoline | 58 | 16 | 3 | 2.59 | 2.86 | 6.00 | |
| Thiophene | 109 | 25 | 3 | 4.87 | 4.47 | 6.00 | |
| 1,2,3,4-tetrahydroi | 56 | 14 | 3 | 2.50 | 2.50 | 6.00 | |
| 2-(4-((quinolin-3-ylmethyl)amino)phenethyl)-1,2,3,4-tetrahydroi | 0 | 2 | 2 | 0.00 | 0.36 | 4.00 |
aCount1, 2, and 3 represent the number of the training compounds, testing compounds and misclassified compounds containing the fragment, respectively, and OR1, 2 and 3 represent the occurrence ratio of the fragment in the training compounds, testing compounds and misclassified compounds, respectively