| Literature DB >> 27735849 |
Shuai-Bing He1, Man-Man Li2, Bai-Xia Zhang3, Xiao-Tong Ye4, Ran-Feng Du5, Yun Wang6, Yan-Jiang Qiao7.
Abstract
During the past decades, there have been continuous attempts in the prediction of metabolism mediated by cytochrome P450s (CYP450s) 3A4, 2D6, and 2C9. However, it has indeed remained a huge challenge to accurately predict the metabolism of xenobiotics mediated by these enzymes. To address this issue, microsomal metabolic reaction system (MMRS)-a novel concept, which integrates information about site of metabolism (SOM) and enzyme-was introduced. By incorporating the use of multiple feature selection (FS) techniques (ChiSquared (CHI), InfoGain (IG), GainRatio (GR), Relief) and hybrid classification procedures (Kstar, Bayes (BN), K-nearest neighbours (IBK), C4.5 decision tree (J48), RandomForest (RF), Support vector machines (SVM), AdaBoostM1, Bagging), metabolism prediction models were established based on metabolism data released by Sheridan et al. Four major biotransformations, including aliphatic C-hydroxylation, aromatic C-hydroxylation, N-dealkylation and O-dealkylation, were involved. For validation, the overall accuracies of all four biotransformations exceeded 0.95. For receiver operating characteristic (ROC) analysis, each of these models gave a significant area under curve (AUC) value >0.98. In addition, an external test was performed based on dataset published previously. As a result, 87.7% of the potential SOMs were correctly identified by our four models. In summary, four MMRS-based models were established, which can be used to predict the metabolism mediated by CYP3A4, 2D6, and 2C9 with high accuracy.Entities:
Keywords: CYP2C9; CYP2D6; CYP3A4; classification; feature selection; metabolism prediction; microsomal metabolic reaction system
Mesh:
Substances:
Year: 2016 PMID: 27735849 PMCID: PMC5085718 DOI: 10.3390/ijms17101686
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1Four series of models for N-dealkylation. Here, CHI, GR, IG, and Relief represent models established based on CHI, GR, IG, and Relief FS techniques, respectively. Plots to compare different combinations of FS algorithms and classification algorithms, where both the accuracy (ACC) and area under curve (AUC) derived from 10-fold cross-validation are plotted as functions of the number of selected features. Different classification methods were characterized with different colors.
Optimal models selection results.
| Reaction Type | Classifier | FS | No. of Features | Scheme |
|---|---|---|---|---|
| I | AdaBoostM1 | GR | 31 | 1–10,13,17,18,21,22,25–33,35,36,45,46,51,54,55 |
| II | RandomForest | none | 56 | 1–56 |
| III | RandomForest | CHI | 39 | 1–33,38,45,46,51,54,55, |
| IV | AdaBoostM1 | none | 56 | 1–56 |
ID in scheme corresponds to feature ID in Table 6.
Prediction quality of the four optimal models, in terms of different statistic indices.
| Data Set | Reaction Type | SE | SP | ACC | BACC | AUC |
|---|---|---|---|---|---|---|
| Training set | I | 0.956 | 0.983 | 0.976 | 0.970 | 0.984 |
| II | 0.953 | 0.921 | 0.928 | 0.937 | 0.950 | |
| III | 0.972 | 0.965 | 0.967 | 0.969 | 0.984 | |
| IV | 0.978 | 0.987 | 0.984 | 0.983 | 0.993 | |
| Test set | I | 0.958 | 0.989 | 0.981 | 0.974 | 0.995 |
| II | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |
| III | 0.985 | 0.939 | 0.950 | 0.962 | 0.984 | |
| IV | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
SE, SP, ACC, BACC, and AUC represent the sensitivity, specificity, accuracy, balanced accuracy and area under curve, respectively.
Figure 2(A) Our models; (B) rule-based models; (C) other ligand-based models and structure-based models. A comparison among our models, rule-based models, ligand-based models, and structure-based models. A1, A2, and A3 represent the sites of metabolism (SOMs) of aliphatic C-hydroxylation; B1, B2, and B3 represent the SOMs of N-dealkylation; C1, C2, and C3 represent the SOMs of aromatic C-hydroxylation; D1, D2, and D3 represent the SOMs of O-dealkylation; E1, E2, and E3 represent the SOMs of other type of biotransformations.
Figure 3(A) CYP450 3A4; (B) CYP450 2D6; (C) CYP450 2C9. Example molecules in the external set that were well-predicted and poorly-predicted by full models established in this work. Experimentally observed SOMs are indicated with gray solid circles. Predicted SOMs with the microsomal metabolic reaction system (MMRS)-based prediction models are marked with hollow arrows where the designated biotransformations were also recorded. Noted are biotransformations marked with light gray—it is suggested that these SOMs were incorrectly identified. In contrast, black represents the SOMs correctly identified.
Description of the datasets collected from the literature.
| Reaction Type | No. of Positive SOM | Percentage (%) |
|---|---|---|
| Aliphatic | 1411 | 61 |
| Aromatic | 314 | 13.6 |
| 347 | 15 | |
| 137 | 5.9 | |
| 57 | 2.5 | |
| 27 | 1.2 | |
| Desulfurization | 7 | 0.3 |
| Others | 15 | 0.6 |
Visual graphic definitions of four biotransformations mediated by CYP450 3A4, 2D6, and 2C9.
| ID | Biotransformation | Definition |
|---|---|---|
| I | Aliphatic | |
| II | Aromatic | |
| III | ||
| IV |
Potential SOM of each biotransformation is marked with solid circle by different colors.
Descriptive statistics of MMRS datasets used for modeling.
| Reaction Type | MMRS Pattern | NO. of MMRS | ||
|---|---|---|---|---|
| Unlabeled | Negtive | Postive | ||
| I | Aliphatic C–H-enzyme | 4233 | 4091 | 1411 |
| II | Aromatic C–H-enzyme | 942 | 828 | 314 |
| III | C–N-enzyme | 1041 | 961 | 347 |
| IV | C–O-enzyme | 411 | 374 | 137 |
Unlabeled, Negative, and Positive represent unlabeled MMRS, negative MMRS, and positive MMRS, respectively.
Descriptor definitions.
| ID | Lable | Description |
|---|---|---|
| Physicochemical descriptors | ||
| 1 | qtot_A | total charge of A |
| 2 | qsigma_A | σ charge of A |
| 3 | qpi_A | π charge of A |
| 4 | pol_A | polarizability of A |
| 5 | oensigma_A | σ orbital-electronegativity of A |
| 6 | oenpiA | π orbital-electronegativity of A |
| 7 | pichgdens_A | π charge density of A |
| 8 | totchgdens_A | total charge density of A |
| 9 | qtot_B | total charge of B |
| 10 | qsigma_B | σ charge of B |
| 11 | qpi_B | π charge of B |
| 12 | pol_B | polarizability of B |
| 13 | oensigma_B | σ orbital-electronegativity of B |
| 14 | oenpi_B | π orbital-electronegativity of B |
| 15 | pichgdens_B | π charge density of B |
| 16 | totchgdens_B | total charge density of B |
| 17 | maxqtot_A, (Except B) | maximum charge of A neighbors |
| 18 | minqtot_A, (Except B) | minimum charge of A neighbors |
| 19 | maxqtot_B, (Except A) | maximum charge of B neighbors |
| 20 | minqtot_B, (Except A) | minimum charge of B neighbors |
| 21 | maxpol_A, (Except B) | maximum polarizability of A neighbors |
| 22 | minpol_A, (Except B) | minimum polarizability of A neighbors |
| 23 | maxpol_B, (Except A) | maximum polarizability of B neighbors |
| 24 | minpol_B, (Except A) | minimum polarizability of B neighbors |
| 25 | dqtot | difference of total charges |
| 26 | dqsigma | difference of σ charges |
| 27 | dqpi | difference of π charges |
| 28 | doensigma | difference of σ orbital-electronegativity |
| 29 | doenpi | difference of π orbital-electronegativity |
| Topological descriptors | ||
| 30 | Hn_A, (Except B) | number of H-atoms bonded to A |
| 31 | Cn_A, (Except B) | number of C-atoms bonded to A |
| 32 | Nn_A, (Except B) | number of N-atoms bonded to A |
| 33 | On_A, (Except B) | number of O-atoms bonded to A |
| 34 | Pn_A, (Except B) | number of P-atoms bonded to A |
| 35 | Sn_A, (Except B) | number of S-atoms bonded to A |
| 36 | Xn_A, (Except B) | number of halogen atoms bonded to A |
| 37 | Hn_B, (Except A) | number of H-atoms bonded to B |
| 38 | Cn_B, (Except A) | number of C-atoms bonded to B |
| 39 | Nn_B, (Except A) | number of N-atoms bonded to B |
| 40 | On_B, (Except A) | number of O-atoms bonded to B |
| 41 | Pn_B, (Except A) | number of P-atoms bonded to B |
| 42 | Sn_B, (Except A) | number of S-atoms bonded to B |
| 43 | Xn_B, (Except A) | number of halogen atoms bonded to B |
| 44 | Csp1_A | A is C sp1 |
| 45 | Csp2_A | A is C sp2 |
| 46 | Csp3_A | A is C sp3 |
| 47 | Csp1_B | B is C sp1 |
| 48 | Csp2_B | B is C sp2 |
| 49 | Csp3_B | B is C sp3 |
| 50 | Csp1_neig A | number of C sp1 neighbors of A |
| 51 | Csp2_neig A | number of C sp2 neighbors of A |
| 52 | Csp3_neig A | number of C sp3 neighbors of A |
| 53 | Csp1_neig B | number of C sp1 neighbors of B |
| 54 | Csp2_neig B | number of C sp2 neighbors of B |
| 55 | Csp3_neig B | number of C sp3 neighbors of B |
| 56 | boord | bond order |
A and B are two atoms located at either end of the chemical bond.
Overview of FS algorithms and classifier algorithms.
| Name | Full Name | Description |
|---|---|---|
| CHI | ChiSquared | Evaluates the worth of an attribute by computing the value of the chi-squared statistic with respect to the class. |
| GR | GainRatio | Evaluates the worth of an attribute by measuring the gain ratio with respect to the class. |
| IG | InfoGain | Evaluates the worth of an attribute by measuring the information gain with respect to the class. |
| Relief | Relief | Evaluates the worth of an attribute by repeatedly sampling an instance and considering the value of the given attribute for the nearest instance of the same and different class. |
| AdaBoostM1 | AdaBoostM1 with C4.5 as its base-level classifier | The purpose of AdaBoostM1 is to find a highly accurate classification rule by combining many weak classifiers, each of which may be only moderately accurate. |
| KStar | KStar | An instance-based classifier that is the class of a test instance is based on the class of those training instances similar to it, as determined by some similarity functions. |
| BN | Bayes | This classifier learns from training data, the conditional probability of each attribute Ai given the class label C. Classification is then done by applying Bayes rule to compute the probability of C given the particular instance of A1, An, and then predicting the class with the highest posterior probability. |
| IBK | K-nearest neighbours | The IBK classifier is commonly based on the Euclidean distance between a test sample and the specified training samples. The classification rule is to assign a test sample to the majority category label of its k nearest training samples. |
| J48 | C4.5 decision tree | This classifier is generated based on the instances, and focuses on deducing classification rules represented by decision tree from a group of out-of-order or out-of-rule samples. |
| RF | RandomForest | Random forest classifiers work by growing a predetermined number of decision trees simultaneously. A test instance is run on all the decision trees grown in the forest. Each tree’s classification of the test instance is recorded as a vote. The votes from all trees are aggregated and the test instance is assigned to the class that receives the maximum vote. |
| SVM | Support vector machines | Input vectors are non-linearly mapped to a very high-dimension feature space. Where a linear decision surface is constructed. The goal of the SVM algorithm is to find an optimal hyperplane that separates the training samples by a maximal margin, with all positive samples lying on one side and all negative samples lying on the other side based on statistical learning theory. In this work, LibSVM was adopted. |
| Bagging | Bagging with KNN as its base-level classifier | Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. |
Figure 4The workflow of optimal model selection.