| Literature DB >> 31781551 |
Xiangeng Wang1,2, Xiaolei Zhu3, Mingzhi Ye1, Yanjing Wang1, Cheng-Dong Li1, Yi Xiong1, Dong-Qing Wei1,2.
Abstract
Membrane transport proteins play crucial roles in the pharmacokinetics of substrate drugs, the drug resistance in cancer and are vital to the process of drug discovery, development and anti-cancer therapeutics. However, experimental methods to profile a substrate drug against a panel of transporters to determine its specificity are labor intensive and time consuming. In this article, we aim to develop an in silico multi-label classification approach to predict whether a substrate can specifically recognize one of the 13 categories of drug transporters ranging from ATP-binding cassette to solute carrier families using both structural fingerprints and chemical ontologies information of substrates. The data-driven network-based label space partition (NLSP) method was utilized to construct the model based on a hybrid of similarity-based feature by the integration of 2D fingerprint and semantic similarity. This method builds predictors for each label cluster (possibly intersecting) detected by community detection algorithms and takes union of label sets for a compound as final prediction. NLSP lies into the ensembles of multi-label classifier category in multi-label learning field. We utilized Cramér's V statistics to quantify the label correlations and depicted them via a heatmap. The jackknife tests and iterative stratification based cross-validation method were adopted on a benchmark dataset to evaluate the prediction performance of the proposed models both in multi-label and label-wise manner. Compared with other powerful multi-label methods, ML-kNN, MTSVM, and RAkELd, our multi-label classification model of NLPS-RF (random forest-based NLSP) has proven to be a feasible and effective model, and performed satisfactorily in the predictive task of transporter-substrate specificity. The idea behind NLSP method is intriguing and the power of NLSP remains to be explored for the multi-label learning problems in bioinformatics. The benchmark dataset, intermediate results and python code which can fully reproduce our experiments and results are available at https://github.com/dqwei-lab/STS.Entities:
Keywords: chemical ontology; membrane transporter; multi-label classification; structural fingerprint; substrate specificity
Year: 2019 PMID: 31781551 PMCID: PMC6851049 DOI: 10.3389/fbioe.2019.00306
Source DB: PubMed Journal: Front Bioeng Biotechnol ISSN: 2296-4185
Figure 1Major steps in the article. Substrates, which were confirmed structural diverse, were featurized into numerical vectors, combined with corresponding transporter multi-label vectors, and then fed into different multi-label learning models. Label correlation analysis provided us insights on the interaction among transporters. To facilitate researchers working on specific membrane transporter, NLSP-RF, with consistently better multi-label performance metrics, was selected after multi-label model comparison for the transporter-wise (single label) analysis. For more detailed description, refer to the subsequent parts in this article.
The average SS of all pairs of substrates on the benchmark dataset for the four types of fingerprints.
| FP2 | 0.1857 |
| FP3 | 0.4449 |
| FP4 | 0.2880 |
| MACCS | 0.3742 |
| Average | 0.3232 |
Figure 2Label correlation landscape. (A) The pair-wise heatmap visualization of Cramér's V statistics. (B) The UpSet visualization of label intersections. The horizontal bars show the number of substrates per transporter and the vertical bars show the number of substrates per transporter category intersection. The filled dots denote the transporter whose exclusive substrates are counted in the corresponding vertical bars. The vertical lines stand for the intersection of substrates of specific transporters. More dots they encompass, more intersections are considered for the tallying of the corresponding vertical bars.
Performance comparison of various multi-label classification methods.
| ML- | 0.0617 | 73.14% | 72.19% | 69.01% | 63.16% |
| MTSVM | 0.0896 | 41.67% | 54.00% | 39.80% | 27.63% |
| RA | 0.1081 | 52.49% | 67.57% | 50.30% | 34.62% |
| RA | 0.0556 | 72.75% | 70.74% | 68.92% | 64.57% |
| RA | 0.0513 | 75.89% | 72.87% | 71.33% | 66.79% |
| NLSP-XGB | 0.0513 | 77.30% | 73.77% | 72.70% | 68.58% |
| NLSP-LGB | 0.0527 | 76.86% | 73.21% | 72.09% | 67.88% |
| NLSP-RF | |||||
| NLSP-EXT | 0.0530 | 77.00% | 73.82% | 72.49% | 68.20% |
The bold value stands for the best value of specific metrics in these models.
Figure 3Comparison of feature importance between structural similarity- and semantic similarity-based features. “FP,” fingerprint, stands for structural similarity-based features. “OT,” ontology, stands for semantic similarity-based features. ****p < 0.0001.
Label-wise analysis of best-performing multi-label learning model.
| ABCG2 | 0.8689 | 0.7221 | 0.4847 | 0.6034 | 0.5769 | 0.8908 | 10 ×10-fold st CV |
| MDR1 | 0.8263 | 0.7796 | 0.8422 | 0.8371 | 0.9243 | 10 ×10-fold st CV | |
| MRP1 | 0.9521 | 0.8394 | 0.4445 | 0.6419 | 0.5753 | 0.9057 | 10 ×10-fold st CV |
| MRP2 | 0.9353 | 0.7221 | 0.2541 | 0.4881 | 0.3602 | 0.9133 | 10 ×10-fold st CV |
| MRP3 | 0.9705 | 0.5975 | 0.3107 | 0.4541 | 0.3885 | 0.8975 | 10 ×10-fold st CV |
| MRP4 | 0.9748 | 0.3667 | 0.1670 | 0.2668 | 0.2174 | 0.9341 | 10 ×10-fold st CV |
| NTCP2 | 0.8667 | 0.8958 | 0.8909 | 10 ×10-fold st CV | |||
| S15A1 | 0.9743 | 0.9174 | 0.8770 | 0.9808 | 10 ×10-fold st CV | ||
| S22A1 | 0.9651 | 0.9194 | 0.6096 | 0.7645 | 0.7304 | 0.9422 | 10 ×10-fold st CV |
| SO1A2 | 0.9732 | 0.4967 | 0.1333 | 0.3150 | 0.2037 | 0.8676 | 10 ×10-fold st CV |
| SO1B1 | 0.9562 | 0.5190 | 0.1410 | 0.330 | 0.2152 | 0.8964 | 10 ×10-fold st CV |
| ABCG2 | 0.76 | 0.756 | 0.764 | 0.76 | 0.77 | Not available | 5-fold cv |
| MDR1 | 0.776 | 0.798 | 0.751 | 0.775 | 0.761 | 5-fold cv | |
| MRP1 | 0.826 | 0.844 | 0.812 | 0.828 | 0.841 | 5-fold cv | |
| MRP2 | 0.814 | 0.886 | 0.746 | 0.816 | 0.804 | 5-fold cv | |
| MRP3 | 0.869 | 0.855 | 0.885 | 0.87 | 0.868 | 5-fold cv | |
| MRP4 | 0.905 | 0.857 | 0.949 | 0.903 | 0.914 | 5-fold cv | |
| NTCP2 | 0.93 | 0.93 | 0.93 | 0.93 | 0.93 | 5-fold cv | |
| S15A1 | 0.847 | 0.819 | 0.869 | 0.844 | 0.864 | 5-fold cv | |
| S22A1 | 0.844 | 0.875 | 0.813 | 0.844 | 0.84 | 5-fold cv | |
| SO1A2 | 0.711 | 0.979 | 0.419 | 0.699 | 0.581 | 5-fold cv | |
| SO1B1 | 0.776 | 0.726 | 0.829 | 0.777 | 0.784 | 5-fold cv |
The bold value stands for the best value of specific metrics in the model of NLSP-RF.
5-fold cv results are from Shaikh et al. (.
Methodological differences between Shaikh's method, and our present method (STS-NLSP).
| Learning framework | Single-label learning | Multi-label learning |
| Machine learning method | SVM, random forest, etc. | NLSP |
| Dataset distribution | A balanced number of substrates and non-substrates for each single transporter, respectively | Substrates categorized into 13 transporters with an imbalanced distribution (910 substrates for a majority of transporter MDR1, and 39 substrates for a minority of transporter SO2B1) |
| Features | Molecular descriptors, molecular fingerprints and Sequence-based descriptors for transporter proteins | Average similarity score fingerprints, and semantic similarity |
| Evaluation metrics | Recall, Specificity, Precision, Accuracy, F1 score, MCC | Aiming, Coverage, Accuracy, Absolute True, Absolute False |
| Validation method | Five-fold cross validation and independent test using an unseen external set | Jackknife test |
Anatomy of the benchmark dataset 𝕊 according to the 13 classes of transporter substrates (see Equation 1). See Supporting Information for further explanation.
| 𝕊1 | ABCG2 | ATP-binding cassette subfamily G member 2 (BCRP) | 344 |
| 𝕊2 | MDR1 | Multidrug resistance protein 1 (P-glycoprotein 1) | 910 |
| 𝕊3 | MRP1 | Multidrug resistance-associated protein 1 | 138 |
| 𝕊4 | MRP2 | Multidrug resistance-associated protein 2 | 136 |
| 𝕊5 | MRP3 | Multidrug resistance-associated protein 3 | 63 |
| 𝕊6 | MRP4 | Multidrug resistance-associated protein 4 | 47 |
| 𝕊7 | NTCP2 | Sodium/taurocholate cotransporter | 53 |
| 𝕊8 | S15A1 | Solute carrier family 15 member 1 (peptide transporter 1) | 230 |
| 𝕊9 | S22A1 | Solute carrier family 22 member 1 (organic cation transporter 1) | 144 |
| 𝕊10 | SO1A2 | Solute carrier organic anion transporter family member 1A2 | 54 |
| 𝕊11 | SO1B1 | Solute carrier organic anion transporter family member 1B1 | 87 |
| 𝕊12 | SO1B3 | Solute carrier organic anion transporter family member 1B3 | 48 |
| 𝕊13 | SO2B1 | Solute carrier organic anion transporter family member 2B1 | 39 |
| Number of total virtual substrates | 2,293 | ||
| Number of total structural different substrates | 1,846 |
The number of virtual substrates is calculated as follows: for a structurally same substrate, its contribution to the total number of virtual substrates is 2 if it occurs in two different classes of transporter substrates; that is 3 if it occurs in three different classes of transporter substrates; and so forth.
Of the 1,846 structural different substrates, 1,591 belong to one class, 145 to two classes, 62 to three classes, 28 to four classes, 12 to five classes, and 4 to six classes, 3 to seven classes, and 1 to nine classes. Refer to .
Iterative Stratification (D, n, r1, …, r)
| 1 | // Generate the required number of samples at each fold |
| 2 | |
| 3 | |
| 4 | // Generate the required number of samples of each label at each fold |
| 5 | |
| 6 | // Calculate the samples of each label in the initial set |
| 7 | |
| 8 | for |
| 9 | |
| 10 | |
| 11 | // Identify the label with the fewest (but at least one) remaining samples, |
| 12 | // Break ties randomly |
| 13 | |
| 14 | |
| 15 | |
| 16 | // Identify the fold(s) with the largest number of required samples for this label |
| 17 | // Break ties by considering the largest number of required samples, break further ties randomly |
| 18 | |
| 19 | |
| 20 | |
| 21 | |
| 22 | |
| 23 | |
| 24 | |
| 25 | |
| 26 | |
| 27 | |
| 28 | |
| 29 | // Update desired number of examples |
| 30 | |
| 31 | |
| 32 | |
| 33 |