| Literature DB >> 29194399 |
Olga Tarasova1, Anastassia Rudik2, Alexander Dmitriev3, Alexey Lagunin4,5, Dmitry Filimonov6, Vladimir Poroikov7.
Abstract
Metabolism of xenobiotics (Greek xenos: exogenous substances) plays an essential role in the prediction of biological activity and testing for the subsequent research and development of new drug candidates. Integration of various methods and techniques using different computational and experimental approaches is one of the keys to a successful metabolism prediction. While multiple structure-based and ligand-based approaches to metabolism prediction exist, the most important problem arises at the first stage of metabolism prediction: detection of the sites of metabolism (SOMs). In this paper, we describe the application of Quantitative Neighborhoods of Atoms (QNA) descriptors for prediction of the SOMs using potential function method, as well as several different machine learning techniques: naïve Bayes, random forest classifier, multilayer perceptron with back propagation and convolutional neural networks, and deep neural networks.Entities:
Keywords: QNA; SOM; computational prediction; cytochromes; quantitative neighborhoods of atoms; sites of metabolism
Mesh:
Substances:
Year: 2017 PMID: 29194399 PMCID: PMC6149875 DOI: 10.3390/molecules22122123
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Figure 1Sensitivity (Sp) and specificity (Se) of the classification models for SOM predictions of the datasets compiled according to the type of metabolizing reaction: (a) datasets before balancing; and (b) balanced datasets. X-axis: Roman numerals correspond to the following reaction types I: aliphatic-hydroxylation; II: aromatic hydroxylation; III: C-oxidation; IV: N-dealkylation; V: N-oxidation; VI: O-dealkylation; and VII: S-oxidation. Machine learning algorithms: RF: random forest; RBF: radial basis function network; MLP: multilayer perceptron; Conv. NN: convolution neural network; AUC: area under the received operating characteristic curve. Y-axis: Sp is obtained as: TP/(TP + FN). Se is obtained as TN/(TN + FP). TP: Number of true positive predicted samples; FP: number of false positive; TN: number of true negative; FN: number of false negative.
Figure 2Sensitivity (Sp) and specificity (Se) of the classification models for SOM predictions of the datasets compiled according to the type of both metabolizing reaction and enzymes: (a) datasets before balancing; and (b) balanced datasets. X-axis: Roman numerals correspond to the following reaction types: I-CYP1A2: CYP1A2-aliphatic hydroxylation; III-CYP1A2: CYP1A2-C-oxidation; III-CYP3A4: CYP3A4-C-oxidation; VI-CYP2C19: CYP2C19-O-dealkylation; VII-CYP2C9: CYP2C9-S-oxidation. Machine learning algorithms: RF: random forest; RBF: radial basis function network; MLP: multilayer perceptron; Conv. NN: convolution neural network; AUC: area under the received operating characteristic curve. Y-axis: Sp = TN/(TN + FP), Se = TP/(TP + FN), TP is the number of true positives, FP is the number of false positives, TN is the number of true negatives, and FN is the number of false negatives.
Average values of performances obtained for the (a) initial datasets before balancing and (b) balanced modeling sets using multiple machine learning classifiers.
| ( | ||||
| Naïve Bayes | 0.90 (0.15) | 0.76 (0.17) | 0.85 (0.02) | 0.48 (0.04) |
| RF | 0.90 (0.05) | 0.75 (0.16) | 0.92 (0.02) | 0.78 (0.19) |
| RBF | 0.87 (0.09) | 0.69 (0.15) | 0.87 (0.04) | 0.51 (0.05) |
| MLP | 0.86 (0.12) | 0.63 (0.16) | 0.87 (0.09) | 0.51 (0.18) |
| Conv. NN | 0.87 (0.15) | 0.67 (0.17) | 0.86 (0.08) | 0.65 (0.09) |
| ( | ||||
| Naïve Bayes | 0.96 (0.007) | 0.83 (0.07) | 0.92 (0.008) | 0.78 (0.02) |
| RF | 0.99 (0.005) | 0.97 (0.01) | 0.99 (0.005) | 0.97 (0.008) |
| RBF | 0.95 (0.007) | 0.88 (0.06) | 0.97 (0.04) | 0.91 (0.016) |
| MLP | 0.97 (0.009) | 0.85 (0.03) | 0.92 (0.19) | 0.87 (0.016) |
| Conv. NN | 0.965 (0.007) | 0.84 (0.05) | 0.94 (0.08) | 0.92 (0.13) |
1 The performances obtained for the datasets compiled according to the type of metabolizing reaction; 2 The performances obtained for the datasets compiled according to both the type of metabolizing reaction and enzymes. RF: random forest; RBF: radial basis function network; MLP: multilayer perceptron; Conv. NN: convolution neural network; AUC: area under the received operating characteristic curve; BA: balanced accuracy.
Comparison of the performances of QNA-based approach and the results reported in Tyzack et al. showing the average and the best AUC values obtained based on the leave-one-out cross-validation.
| Isoform of Cytochrome | AUCav 1 | SDAUC 2 | AUCmax 3 | AUC[ref] 4 |
|---|---|---|---|---|
| CYP2C9 | 0.91 | 0.07 | 0.99 | 0.97 |
| CYP2D6 | 0.90 | 0.09 | 0.99 | 0.97 |
| CYP3A4 | 0.91 | 0.07 | 0.97 | 0.96 |
1 Average AUC values obtained in the methods, implemented in Weka (five-fold random division into training, test sets, and validation sets (for MLP, ConvNN only)); 2 Standard deviation for the AUC values; 3 The highest AUC value; 4 AUC values for methods implemented in Tyzack et al. that have been previously published.
Comparison of the performances of the QNA-based approach and the results reported in Rudik et al., 2014. The average and best AUC values in our method were obtained by means of random division into the training and test sets in a proportion of 2:1.
| Isoform of Cytochrome | AUCav 1 | SDAUC 2 | AUCmax 3 | AUC[ref] 4 |
|---|---|---|---|---|
| CYP1A2 | 0.86 | 0.12 | 0.98 | 0.83 |
| CYP2C9 | 0.85 | 0.08 | 0.94 | 0.87 |
| CYP2C19 | 0.82 | 0.09 | 0.92 | 0.76 |
| CYP2D6 | 0.87 | 0.09 | 0.94 | 0.83 |
| CYP3A4 | 0.85 | 0.09 | 0.95 | 0.85 |
1 Average AUC values obtained in the methods, implemented in Weka; 2 Standard deviation for the AUC values; 3 The highest AUC value; 4 AUC values for methods implemented In Rudik et al. that were previously published.
The number of chemical structures in the modeling sets used in the study.
| Isoform of Cytochrome | I | II | III | IV | V | VI | VII | N | N1 |
|---|---|---|---|---|---|---|---|---|---|
| CYP1A2 | 201 | 31 | 179 | 55 | 108 | 44 | 165 | 463 | 803 |
| CYP2C9 | 165 | 167 | 25 | 125 | 19 | 93 | 35 | 268 | 643 |
| CYP2C19 | 172 | 122 | 13 | 138 | 23 | 87 | 49 | 369 | 607 |
| CYP2D6 | 167 | 200 | 13 | 215 | 34 | 149 | 36 | 466 | 846 |
| CYP3A4 | 474 | 269 | 179 | 419 | 95 | 207 | 112 | 570 | 1004 |
The designations of columns are: I: aliphatic hydroxylation; II: aromatic hydroxylation; III: C-oxidation; IV: N-dealkylation; V: N-oxidation; VI: O-dealkylation; VII: S-oxidation; N: number of unique substrates; and N1: number of records with the labeled SOM.
Figure 3An example of the generation of QNA descriptors for the molecule 9-methylacridine.