| Literature DB >> 28522849 |
Li Zhang1,2, Haixin Ai1,2,3, Wen Chen4, Zimo Yin4, Huan Hu1, Junfeng Zhu1, Jian Zhao1, Qi Zhao2,5, Hongsheng Liu6,7,8.
Abstract
Carcinogenicity refers to a highly toxic end point of certain chemicals, and has become an important issue in the drug development process. In this study, three novel ensemble classification models, namely Ensemble SVM, Ensemble RF, and Ensemble XGBoost, were developed to predict carcinogenicity of chemicals using seven types of molecular fingerprints and three machine learning methods based on a dataset containing 1003 diverse compounds with rat carcinogenicity. Among these three models, Ensemble XGBoost is found to be the best, giving an average accuracy of 70.1 ± 2.9%, sensitivity of 67.0 ± 5.0%, and specificity of 73.1 ± 4.4% in five-fold cross-validation and an accuracy of 70.0%, sensitivity of 65.2%, and specificity of 76.5% in external validation. In comparison with some recent methods, the ensemble models outperform some machine learning-based approaches and yield equal accuracy and higher specificity but lower sensitivity than rule-based expert systems. It is also found that the ensemble models could be further improved if more data were available. As an application, the ensemble models are employed to discover potential carcinogens in the DrugBank database. The results indicate that the proposed models are helpful in predicting the carcinogenicity of chemicals. A web server called CarcinoPred-EL has been built for these models ( http://ccsipb.lnu.edu.cn/toxicity/CarcinoPred-EL/ ).Entities:
Mesh:
Substances:
Year: 2017 PMID: 28522849 PMCID: PMC5437031 DOI: 10.1038/s41598-017-02365-0
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Chemical space of the training set. The chemical space is defined by the molecular weight (MW) on the X-axis and the logarithm of the octanol/water partition coefficient (ALogP) on the Y-axis. Carcinogens and non-carcinogens are represented by red and green dots, respectively.
Figure 2Box plot representing the molecular descriptors for carcinogens and non-carcinogens. Carcinogens and non-carcinogens are represented by red and green boxes, respectively.
Performance of the basic classifiers in five-fold cross-validation. The performance values are represented as means and standard deviation.
| Algorithms | Fingerprints | Q (%) | SE (%) | SP (%) | AUC (%) |
|---|---|---|---|---|---|
| SVM | CDK | 67.5 ± 2.9 | 63.5 ± 4.9 | 71.5 ± 4.9 | 73.8 ± 3.0 |
| CDKExt | 67.9 ± 2.9 | 62.9 ± 5.1 | 72.7 ± 4.9 | 73.7 ± 3.2 | |
| CDKGraph | 65.0 ± 3.1 | 61.5 ± 5.2 | 68.4 ± 5.0 | 69.4 ± 3.4 | |
| Estate | 63.0 ± 2.9 | 57.8 ± 5.3 | 68.0 ± 5.0 | 68.3 ± 3.2 | |
| MACCS | 67.1 ± 3.1 | 63.6 ± 5.1 | 70.6 ± 4.9 | 72.0 ± 3.3 | |
| Pubchem | 68.1 ± 3.0 | 64.7 ± 4.9 | 71.5 ± 4.5 | 72.8 ± 3.2 | |
| FP4 | 64.6 ± 3.0 | 63.7 ± 4.9 | 65.4 ± 5.0 | 68.9 ± 3.1 | |
| FP4C | 62.2 ± 3.2 | 62.6 ± 5.0 | 61.8 ± 5.1 | 65.5 ± 3.5 | |
| KR | 66.5 ± 2.9 | 65.7 ± 4.9 | 67.2 ± 4.8 | 71.9 ± 3.1 | |
| KRC | 66.7 ± 3.0 | 67.5 ± 4.9 | 66.0 ± 5.1 | 72.1 ± 3.2 | |
| AP2D | 63.5 ± 3.0 | 56.3 ± 5.3 | 70.5 ± 5.4 | 68.3 ± 3.4 | |
| AP2DC | 63.4 ± 3.0 | 57.0 ± 6.1 | 69.7 ± 5.9 | 68.9 ± 3.2 | |
| RF | CDK | 68.3 ± 3.0 | 64.5 ± 5.1 | 72.1 ± 4.5 | 74.1 ± 3.1 |
| CDKExt | 68.4 ± 2.9 | 63.9 ± 4.8 | 72.8 ± 4.4 | 74.3 ± 3.1 | |
| CDKGraph | 66.6 ± 2.8 | 64.0 ± 4.7 | 69.0 ± 4.4 | 71.3 ± 3.1 | |
| Estate | 64.2 ± 3.0 | 61.6 ± 4.8 | 66.7 ± 4.9 | 69.9 ± 3.2 | |
| MACCS | 67.4 ± 2.9 | 63.4 ± 4.6 | 71.3 ± 4.4 | 73.1 ± 2.9 | |
| Pubchem | 68.0 ± 3.0 | 65.7 ± 4.9 | 70.3 ± 4.6 | 74.2 ± 3.1 | |
| FP4 | 62.1 ± 3.0 | 65.3 ± 4.8 | 59.1 ± 5.0 | 66.8 ± 3.4 | |
| FP4C | 63.6 ± 3.2 | 63.9 ± 5.0 | 63.3 ± 4.9 | 67.9 ± 3.5 | |
| KR | 67.0 ± 2.9 | 66.5 ± 4.8 | 67.6 ± 4.9 | 73.3 ± 3.0 | |
| KRC | 66.5 ± 2.9 | 68.0 ± 4.5 | 65.1 ± 4.6 | 73.0 ± 3.0 | |
| AP2D | 64.1 ± 2.9 | 56.5 ± 5.1 | 71.5 ± 4.7 | 68.2 ± 3.2 | |
| AP2DC | 64.7 ± 3.0 | 59.6 ± 5.4 | 69.7 ± 4.9 | 70.9 ± 3.3 | |
| XGBoost | CDK | 67.0 ± 3.0 | 65.9 ± 5.1 | 68.2 ± 4.9 | 73.6 ± 3.0 |
| CDKExt | 68.3 ± 2.9 | 66.0 ± 4.5 | 70.6 ± 4.4 | 74.5 ± 2.9 | |
| CDKGraph | 65.1 ± 3.1 | 64.7 ± 4.6 | 65.5 ± 4.8 | 70.8 ± 3.2 | |
| Estate | 63.0 ± 2.9 | 60.9 ± 4.8 | 65.0 ± 4.8 | 69.5 ± 3.0 | |
| MACCS | 67.2 ± 2.9 | 65.5 ± 4.9 | 68.8 ± 4.7 | 73.2 ± 2.9 | |
| Pubchem | 67.8 ± 3.1 | 66.7 ± 5.2 | 68.8 ± 4.8 | 73.8 ± 3.2 | |
| FP4 | 62.5 ± 2.7 | 66.1 ± 4.6 | 59.0 ± 4.4 | 65.9 ± 3.1 | |
| FP4C | 61.1 ± 3.2 | 61.3 ± 4.9 | 60.8 ± 5.1 | 65.2 ± 3.3 | |
| KR | 66.0 ± 3.0 | 66.8 ± 4.8 | 65.2 ± 4.9 | 72.7 ± 3.0 | |
| KRC | 66.5 ± 3.1 | 66.2 ± 4.8 | 66.8 ± 4.7 | 73.0 ± 3.1 | |
| AP2D | 64.4 ± 3.0 | 59.0 ± 5.0 | 69.5 ± 4.7 | 70.0 ± 3.3 | |
| AP2DC | 64.4 ± 3.2 | 60.9 ± 5.2 | 67.7 ± 4.8 | 70.9 ± 3.3 |
Performance of ensemble models in five-fold cross-validation. The performance values are represented as means and standard deviation.
| Models | Fingerprints | Q (%) | SE (%) | SP (%) | AUC (%) |
|---|---|---|---|---|---|
| Ensemble SVM | Top 7 | 69.4 ± 2.9 | 65.2 ± 5.2 | 73.5 ± 4.6 | 75.6 ± 3.0 |
| Ensemble RF | Top 7 | 69.2 ± 2.9 | 67.0 ± 5.1 | 71.3 ± 4.6 | 75.7 ± 2.9 |
| Ensemble XGBoost | Top 7 | 70.1 ± 2.9 | 67.0 ± 5.0 | 73.1 ± 4.4 | 76.5 ± 2.9 |
| Ensemble SVM 2 | All 12 | 69.1 ± 3.0 | 64.3 ± 5.3 | 73.7 ± 4.7 | 76.0 ± 3.1 |
| Ensemble RF 2 | All 12 | 68.6 ± 2.9 | 65.5 ± 4.9 | 71.6 ± 4.6 | 75.5 ± 3.0 |
| Ensemble XGBoost 2 | All 12 | 69.8 ± 3.0 | 65.8 ± 5.0 | 73.7 ± 4.5 | 76.6 ± 3.0 |
Performance of ensemble models and some existing software in the external validation dataset.
| Models | Type | Q (%) | SE (%) | SP (%) | AUC (%) |
|---|---|---|---|---|---|
| Ensemble SVM | machine learning | 67.5 | 60.9 | 76.5 | 81.8 |
| Ensemble RF | machine learning | 65.0 | 56.5 | 76.5 | 80.1 |
| Ensemble XGBoost | machine learning | 70.0 | 65.2 | 76.5 | 80.3 |
| admetSAR | machine learning | 50.0 | 34.8 | 70.6 | 49.6 |
| PreADMET | machine learning | 62.5 | 52.2 | 76.5 | —a |
| VEGA CAESAR | machine learning | 70.0 | 65.2 | 76.5 | —a |
| VEGA ISS | rule based | 70.0 | 73.9 | 64.7 | —a |
| VEGA IRFMN/Antares | rule based | 70.0 | 78.3 | 58.8 | —a |
| VEGA IRFMN/ISSCAN-CGX | rule based | 75.0 | 82.6 | 64.7 | —a |
| Toxtree | rule based | 70.0 | 78.3 | 58.6 | —a |
| lazar | similarity search | 75.0 | 87.0 | 58.8 | —a |
aThe AUC cannot be calculated for this software because there are no probability values in its results.
Performance indicators and the evaluation method of some carcinogenicity classification models reported in the literature.
| Model name | Evaluation method | Q (%) | SE (%) | SP (%) | Reference |
|---|---|---|---|---|---|
| MC4PCa | 10-fold CVe | 66.5 | 61.4 | 70.9 |
|
| MDL-QSARb | 10-fold CV | 69.2 | 62.8 | 74.8 |
|
| lazar | LOOCVf | 66.9 | 59.9 | 73.4 |
|
| Naïve Bayesian | 5-fold CV | 68 | 57 | 79 |
|
| CP ANN MDLc | 5-fold CVg | 66 | — | — |
|
| CP ANN Dragon (VEGA CAESAR)c | 5-fold CV | 62 | — | — |
|
| VEGA IRFMN/Antares | 5-fold CV | 66.0 | 83.1 | 48.3 |
|
| VEGA IRFMN/ISSCAN-CGXd | 5-fold CV | 72.7 | 76.5 | 61.8 |
|
aThe coverage of this model was 96%. bThe coverage of this model was 97%. cThis study did not provide the SE and SP of the models. dThis model was trained using carcinogenesis data from both rats and mice. eTen-fold cross-validation. fLeave-one-out cross-validation. gFive-fold cross-validation.
Figure 3Performance on five-fold cross-validation (a) and external validation (b) as a function of number of compounds in training set for RF ensemble models. The performance values are represented as means and standard error.
Figure 4Feature importance results for top-five features from each RF model trained with Estate, MACCS, Pubchem, FP4, KR, and AP2D fingerprints. The MeanDecreaseGini values are represented as means and standard deviation.
Top ranking substructures and their corresponding description and the number of occurrence in carcinogens and non-carcinogens.
| Fingerprint Key | Description | SMARTS Pattern | Present in Carcinogens | Present in Non-Carcinogens |
|---|---|---|---|---|
| AP2D-14 | N-O at topological distance 1 | [#7]~[#8] | 175 | 67 |
| AP2D-13 | N-N at topological distance 1 | [#7]~[#7] | 160 | 54 |
| Estate-28 | dsN | [ND2H0]( = *)-* | 155 | 69 |
| KR-4117 | N = O | N = O | 162 | 59 |
| KR-4301 | NN | NN | 137 | 40 |
| MACCS-52 | NN | [#7]~[#7] | 160 | 54 |
| MACCS-63 | N = O | [#7] = [#8] | 162 | 59 |
| Pubchem-423 | N = O | [#7] = ,:[#8] | 163 | 60 |
| Pubchem-515 | N-N-C-C | N-N-C-C | 131 | 43 |
| FP4-88 | Carboxylic acid derivative | [$([#6X3 H0][#6]),$([#6X3H])]( = [!#6])[!#6] | 136 | 234 |
Predicted carcinogenic drugs with predicted probabilities >0.8.
| DrugBank ID | Name | Probabilities | Remarks | ||
|---|---|---|---|---|---|
| SVM | RF | XGBoost | |||
| DB00262 | Carmustine | 0.8 | 0.87 | 0.96 | IARC Group 2A |
| DB09158 | Trypan blue | 0.78 | 0.91 | 0.95 | IARC Group 2B |
| DB00614 | Furazolidone | 0.73 | 0.85 | 0.94 | IARC Group 3 |
| DB01206 | Lomustine | 0.71 | 0.78 | 0.91 | IARC Group 2A |
| DB04106 | Fotemustine | 0.71 | 0.74 | 0.89 | Mutagen to Salmonella |
| DB01260 | Desonide | 0.73 | 0.83 | 0.87 | Corticosteroid |
| DB03035 | 1,8-Dihydroxy-4-Nitroanthraquinone | 0.72 | 0.73 | 0.85 | — |
| DB00288 | Amcinonide | 0.69 | 0.73 | 0.84 | Corticosteroid |
| DB02636 | 9-hydroxy aristolochic acid | 0.65 | 0.69 | 0.83 | Derivative of aristolochic acid (IARC Group 1) |
| DB07983 | Iodoindomethacin | 0.64 | 0.7 | 0.82 | — |
| DB00591 | Fluocinolone Acetonide | 0.73 | 0.82 | 0.81 | Corticosteroid |
| DB00180 | Flunisolide | 0.73 | 0.81 | 0.81 | Corticosteroid |
| DB01047 | Fluocinonide | 0.69 | 0.73 | 0.81 | Corticosteroid |
| DB08594 | tert-butyl N-[cyano(methyl)amino]carbamate | 0.69 | 0.64 | 0.8 | — |
| DB01976 | 1-Aminoanthracene | 0.71 | 0.8 | 0.73 | Mutagen to Genotoxic to DrosophilaSalmonella |
Summary of the 12 types of molecular fingerprints.
| Fingerprint Type | Abbreviation | Pattern Type | Size (bits) | Selected (bits) |
|---|---|---|---|---|
| CDK | CDK | Hash fingerprints | 1024 | 931 |
| CDK Extended | CDKExt | Hash fingerprints | 1024 | 942 |
| CDK Graph | CDKGraph | Hash fingerprints | 1024 | 233 |
| Estate | Estate | Structural features | 79 | 19 |
| MACCS | MACCS | Structural features | 166 | 84 |
| Pubchem | Pubchem | Structural features | 881 | 106 |
| Substructure | FP4 | Structural features | 307 | 31 |
| Substructure Count | FP4C | Structural features count | 307 | 27 |
| Klekota-Roth | KR | Structural features | 4860 | 97 |
| Klekota-Roth Count | KRC | Structural features count | 4860 | 59 |
| 2D Atom Pairs | AP2D | Structural features | 780 | 47 |
| 2D Atom Pairs Count | AP2DC | Structural features count | 780 | 25 |
Figure 5Flowchart to show the ensemble model building process.