| Literature DB >> 27483216 |
Ismail Babajide Mustapha1, Faisal Saeed2.
Abstract
Following the explosive growth in chemical and biological data, the shift from traditional methods of drug discovery to computer-aided means has made data mining and machine learning methods integral parts of today's drug discovery process. In this paper, extreme gradient boosting (Xgboost), which is an ensemble of Classification and Regression Tree (CART) and a variant of the Gradient Boosting Machine, was investigated for the prediction of biological activity based on quantitative description of the compound's molecular structure. Seven datasets, well known in the literature were used in this paper and experimental results show that Xgboost can outperform machine learning algorithms like Random Forest (RF), Support Vector Machines (LSVM), Radial Basis Function Neural Network (RBFN) and Naïve Bayes (NB) for the prediction of biological activities. In addition to its ability to detect minority activity classes in highly imbalanced datasets, it showed remarkable performance on both high and low diversity datasets.Entities:
Keywords: biological data; drug discovery; prediction of biological activity; virtual screening
Mesh:
Year: 2016 PMID: 27483216 PMCID: PMC6273295 DOI: 10.3390/molecules21080983
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Activity Classes for cyclooxygenase-2 (COX2) estrogen receptor (ER) and benzodiazepine receptor (BZR) Datasets.
| Datasets | Number of Compounds | Pairwise Similarity (Mean) | ||||
|---|---|---|---|---|---|---|
| Active | Inactive | Active | Inactive | |||
| Training | Validation | Training | Validation | |||
| Cyclooxygenase-2 inhibitors | 211 | 92 | 116 | 48 | 0.687 | 0.690 |
| Benzodiazepine receptor | 214 | 92 | 70 | 29 | 0.536 | 0.538 |
| Estrogen receptor | 86 | 55 | 190 | 62 | 0.468 | 0.456 |
Number of Active (Na) compounds for 12 Directory of Useful Decoys (DUD) datasets.
| No | Activity Class | Na | |
|---|---|---|---|
| Training | Validation | ||
| 1 | FGFR1T | 90 | 30 |
| 2 | FXA | 106 | 40 |
| 3 | GART | 27 | 13 |
| 4 | GBP | 38 | 14 |
| 5 | GR | 55 | 23 |
| 6 | HIVPR | 42 | 20 |
| 7 | HIVRT | 32 | 11 |
| 8 | HMGA | 24 | 11 |
| 9 | HSP90 | 24 | 13 |
| 10 | MR | 10 | 5 |
| 11 | NA | 35 | 14 |
| 12 | PR | 22 | 5 |
Activity Classes for MDDR1.
| Activity Index | Activity Class | Active Molecules | Pairwise Similarity | |
|---|---|---|---|---|
| Training | Validation | Mean | ||
| 31420 | renin inhibitors | 783 | 347 | 0.573 |
| 71523 | HIV protease inhibitors | 535 | 215 | 0.446 |
| 37110 | thrombin inhibitors | 561 | 242 | 0.419 |
| 31432 | angiotensin II AT1 antagonists | 674 | 269 | 0.403 |
| 42731 | substance P antagonists | 859 | 387 | 0.339 |
| 06233 | 5HT3 antagonists | 530 | 222 | 0.351 |
| 06245 | 5HT reuptake inhibitors | 257 | 102 | 0.345 |
| 07701 | D2 antagonists | 268 | 127 | 0.345 |
| 06235 | 5HT1A agonists | 589 | 238 | 0.343 |
| 78374 | protein kinase C inhibitors | 326 | 127 | 0.323 |
| 78331 | cyclooxygenase inhibitors | 427 | 209 | 0.268 |
Activity Classes for MDDR2.
| Activity Index | Activity Class | Active Molecules | Pairwise Similarity | |
|---|---|---|---|---|
| Training | Validation | Mean | ||
| 07707 | adenosine (A1) agonists | 136 | 71 | 0.424 |
| 07708 | adenosine (A2) agonists | 119 | 37 | 0.484 |
| 31420 | renin inhibitors | 791 | 339 | 0.584 |
| 42710 | monocyclic β-lactams | 78 | 33 | 0.596 |
| 64100 | cephalosporins | 911 | 390 | 0.512 |
| 64200 | carbacephems | 115 | 43 | 0.503 |
| 64220 | carbapenems | 732 | 319 | 0.414 |
| 64300 | penicillin | 88 | 38 | 0.444 |
| 65000 | antibiotic, macrolide | 268 | 120 | 0.673 |
| 75755 | vitamin D analogous | 323 | 132 | 0.569 |
Activity Classes for MDDR3.
| Activity Index | Activity Class | Active Molecules | Pairwise Similarity | |
|---|---|---|---|---|
| Training | Validation | Mean | ||
| 09249 | muscarinic (M1) agonists | 620 | 280 | 0.257 |
| 12455 | NMDA receptor antagonists | 990 | 410 | 0.311 |
| 12464 | nitric oxide synthase inhibitors | 348 | 157 | 0.237 |
| 31281 | dopamine β-hydroxylase inhibitors | 76 | 30 | 0.324 |
| 43210 | aldose reductase inhibitors | 663 | 294 | 0.37 |
| 71522 | reverse transcriptase inhibitors | 501 | 199 | 0.311 |
| 75721 | aromatase inhibitors | 444 | 192 | 0.318 |
| 78331 | cyclooxygenase inhibitors | 449 | 187 | 0.382 |
| 78348 | phospholipase A2 inhibitors | 430 | 187 | 0.291 |
| 78351 | lipoxygenase inhibitors | 1478 | 633 | 0.365 |
Figure 1Experimental Design.
Sensitivity, Specificity, Area under Curve, Accuracy and F-measure on MDDR1 Dataset.
| ML Algorithm | Training | Validation | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| SEN | SPC | AUC | ACC | F-Sc | SEN | SPC | AUC | ACC | F-Sc | |
| XGB | 0.9484 | 0.9958 | 0.9721 | 0.9575 | 0.9830 | 0.9579 | 0.9960 | 0.9769 | 0.9594 | 0.9536 |
| RF | 0.9474 | 0.9963 | 0.9718 | 0.9621 | 0.9514 | 0.9502 | 0.9957 | 0.9730 | 0.9590 | 0.9525 |
| LSVM | 0.9258 | 0.9943 | 0.9600 | 0.9425 | 0.9264 | 0.9357 | 0.9948 | 0.9653 | 0.9497 | 0.9371 |
| RBFN | 0.7566 | 0.9773 | 0.8670 | 0.7719 | 0.7451 | 0.7751 | 0.9777 | 0.8764 | 0.7746 | 0.7553 |
| NB | 0.7648 | 0.9781 | 0.8715 | 0.7826 | 0.7578 | 0.7488 | 0.9762 | 0.8625 | 0.7626 | 0.7383 |
Sensitivity, Specificity, Area under Curve, Accuracy and F-measure on MDDR2 Dataset.
| ML Algorithm | Training | Validation | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| SEN | SPC | AUC | ACC | F-Sc | SEN | SPC | AUC | ACC | F-Sc | |
| XGB | 0.9779 | 0.9981 | 0.9880 | 0.9834 | 0.9689 | 0.9820 | 0.9983 | 0.9902 | 0.9849 | 0.9673 |
| RF | 0.9562 | 0.9979 | 0.9771 | 0.9837 | 0.9689 | 0.9468 | 0.9977 | 0.9723 | 0.9823 | 0.9597 |
| LSVM | 0.9590 | 0.9978 | 0.9784 | 0.9817 | 0.9667 | 0.9436 | 0.9974 | 0.9705 | 0.9790 | 0.9547 |
| RBFN | 0.9507 | 0.9961 | 0.9734 | 9.9646 | 0.9402 | 0.9420 | 0.9960 | 0.9690 | 0.9658 | 0.9312 |
| NB | 0.9546 | 0.9963 | 0.9755 | 0.9677 | 0.9458 | 0.9401 | 0.9967 | 0.9684 | 0.9724 | 0.9403 |
Sensitivity, Specificity, Area under Curve, Accuracy and F-measure on MDDR3 Dataset.
| ML Algorithm | Training | Validation | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| SEN | SPC | AUC | ACC | F-Sc | SEN | SPC | AUC | ACC | F-Sc | |
| XGB | 0.9407 | 0.9937 | 0.9672 | 0.9440 | 0.9348 | 0.9493 | 0.9937 | 0.9715 | 0.9447 | 0.9448 |
| RF | 0.9209 | 0.9929 | 0.9569 | 94.099 | 0.9350 | 0.9316 | 0.9928 | 0.9622 | 0.9397 | 0.9405 |
| LSVM | 0.8800 | 0.9885 | 0.93425 | 90.4651 | 0.8948 | 0.8983 | 0.9902 | 0.9443 | 0.9171 | 0.9120 |
| RBFN | 0.7053 | 0.9643 | 0.8348 | 68.0613 | 0.6597 | 0.7254 | 0.9657 | 0.8456 | 0.6890 | 0.6710 |
| NB | 0.6803 | 0.9613 | 0.8208 | 65.7276 | 0.6402 | 0.6636 | 0.9594 | 0.8115 | 0.6415 | 0.6211 |
Sensitivity, Specificity, Area under Curve, Accuracy and F-measure on DUD Dataset.
| ML Algorithm | Training | Validation | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| SEN | SPC | AUC | ACC | F-Sc | SEN | SPC | AUC | ACC | F-Sc | |
| XGB | 0.8677 | 0.9920 | 0.9298 | 0.9113 | 0.8616 | 0.8569 | 0.9953 | 0.9261 | 0.9471 | 0.8673 |
| RF | 0.8861 | 0.9935 | 0.9397 | 0.9294 | 0.8908 | 0.9078 | 0.9951 | 0.9515 | 0.9471 | 0.9123 |
| LSVM | 0.8659 | 0.9919 | 0.9289 | 0.9113 | 0.8683 | 0.8738 | 0.9941 | 0.9340 | 0.9375 | 0.8862 |
| RBFN | 0.8228 | 0.9895 | 0.9061 | 0.8871 | 0.8344 | 0.8503 | 0.9931 | 0.9217 | 0.9279 | 0.8537 |
| NB | 0.8783 | 0.9910 | 0.9346 | 0.9032 | 0.8730 | 0.9177 | 0.9942 | 0.9559 | 0.9375 | 0.9193 |
Sensitivity, Specificity, Area under Curve, Accuracy and F-measure on COX2 Dataset.
| ML Algorithm | Training | Validation | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| SEN | SPC | AUC | ACC | F-Sc | SEN | SPC | AUC | ACC | F-Sc | |
| XGB | 0.9361 | 0.9444 | 0.9403 | 0.9388 | 0.9535 | 0.9570 | 0.9362 | 0.9466 | 0.9500 | 0.9622 |
| RF | 0.9763 | 0.8879 | 0.9321 | 0.9450 | 0.9581 | 0.9783 | 0.8750 | 0.9266 | 0.9429 | 0.9574 |
| LSVM | 0.9526 | 0.9138 | 0.9332 | 0.9388 | 0.9526 | 0.9565 | 0.8958 | 0.9262 | 0.9357 | 0.9514 |
| RBFN | 0.9293 | 0.7203 | 0.8248 | 0.8379 | 0.8658 | 0.9250 | 0.7000 | 0.8125 | 0.8286 | 0.8605 |
| NB | 0.6777 | 0.9569 | 0.8173 | 0.7768 | 0.7967 | 0.7065 | 1.0000 | 0.8533 | 0.8071 | 0.8280 |
Sensitivity, Specificity, Area under Curve, Accuracy and F-measure on BZR Dataset.
| ML Algorithm | Training | Validation | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| SEN | SPC | AUC | ACC | F-Sc | SEN | SPC | AUC | ACC | F-Sc | |
| XGB | 0.9764 | 0.9028 | 0.9396 | 0.9577 | 0.9718 | 0.9884 | 0.8000 | 0.8942 | 0.9339 | 0.9551 |
| RF | 0.9720 | 0.9143 | 0.9431 | 0.9577 | 0.9720 | 0.9674 | 0.8966 | 0.9320 | 0.9504 | 0.9674 |
| LSVM | 0.9579 | 0.8714 | 0.9147 | 0.9366 | 0.9579 | 0.9348 | 1.0000 | 0.9674 | 0.9504 | 0.9663 |
| RBFN | 0.9947 | 0.7263 | 0.8605 | 0.9049 | 0.9330 | 1.0000 | 0.6444 | 0.8222 | 0.8678 | 0.9048 |
| NB | 0.9112 | 0.8571 | 0.8842 | 0.8979 | 0.9308 | 0.8478 | 0.9655 | 0.9067 | 0.8760 | 0.9123 |
Sensitivity, Specificity, Area under Curve, Accuracy and F-measure on ER Dataset.
| ML Algorithm | Training | Validation | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| SEN | SPC | AUC | ACC | F-Sc | SEN | SPC | AUC | ACC | F-Sc | |
| XGB | 0.7671 | 0.8522 | 0.8097 | 0.8297 | 0.7044 | 0.8837 | 0.7703 | 0.8270 | 0.8120 | 0.7755 |
| RF | 0.6860 | 0.8895 | 0.7878 | 0.8261 | 0.7108 | 0.6364 | 0.8226 | 0.7295 | 0.7350 | 0.6931 |
| LSVM | 0.6628 | 0.9316 | 0.7972 | 0.8478 | 0.7308 | 0.6727 | 0.9194 | 0.7960 | 0.8034 | 0.7629 |
| RBFN | 0.7089 | 0.8477 | 0.7783 | 0.8080 | 0.6788 | 0.8478 | 0.7746 | 0.8112 | 0.8034 | 0.7723 |
| NB | 0.9767 | 0.6368 | 0.8068 | 0.7428 | 0.7029 | 0.9818 | 0.5645 | 0.7732 | 0.7607 | 0.7941 |
Rankings of Prediction Methods based on Kendall W Test Using Accuracy Measure.
| Measure | W | P | Ranks |
|---|---|---|---|
| Accuracy | 0.65 | 0.001 | XGBOOST > RF > LSVM > RBFN > NB |