| Literature DB >> 34944394 |
Medard Edmund Mswahili1, Gati Lother Martin1, Jiyoung Woo1, Guang J Choi2, Young-Seob Jeong3.
Abstract
Malaria remains by far one of the most threatening and dangerous illnesses caused by the plasmodium falciparum parasite. Chloroquine (CQ) and first-line artemisinin-based combination treatment (ACT) have long been the drug of choice for the treatment and controlling of malaria; however, the emergence of CQ-resistant and artemisinin resistance parasites is now present in most areas where malaria is endemic. In this work, we developed five machine learning models to predict antimalarial bioactivities of a drug against plasmodium falciparum from the features (i.e., molecular descriptors values) obtained from PaDEL software from SMILES of compounds and compare the machine learning models by experiments with our collected data of 4794 instances. As a consequence, we found that three models amongst the five, namely artificial neural network (ANN), extreme gradient boost (XGB), and random forest (RF), outperform the others in terms of accuracy while observing that, using roughly a quarter of the promising descriptors picked by the feature selection algorithm, the five models achieved equivalent and comparable performance. Nevertheless, the contribution of all molecular descriptors in the models was investigated through the comparison of their rank values by the feature selection algorithm and found that the most potent and relevant descriptors which come from the 'Autocorrelation' module contributed more while the 'Atom type electrotopological state' contributed the least to the model.Entities:
Keywords: PaDEL; antimalarial drug; drug discovery; feature selection; machine learning; molecular descriptor; plasmodium falciparum
Mesh:
Substances:
Year: 2021 PMID: 34944394 PMCID: PMC8698534 DOI: 10.3390/biom11121750
Source DB: PubMed Journal: Biomolecules ISSN: 2218-273X
Figure 1The development process for antimalarial drug prediction, from data gathering through ML models deployment.
The glimpse sample of unprocessed data.
| Service | ChEMBL_synonyms_PubChem_SID | Canonical_Isomeric_SMILES (Sources: PubChem_ChEMBL_and_EMBL-EBI) | Label |
|---|---|---|---|
| ChEMBL_&_PubChem | CHEMBL219517 | C1CSCN(C1=O)CCCNC2=C3C=CC(=CC3=NC=C2)Cl | 0 |
| 380797 | CC(C1=CC=CC=C1)NC(=O)C2=CC=CC=C2N=CC3=C(C=CC4=CC=CC=C43)O | 0 | |
| 591362 | C1=CC=C(C(=C1)C(=O)NC2=NC(=CS2)C3=CC=CC=N3)Br | 0 | |
| 465546 | C[C@@]1(CC[C@@H]2[C@]3(CC[C@@H](C([C@@H]3CC[C@]2(C1)O)(C)C)O)C)C=C | 0 | |
| 341638 | CCN(CC)CCCCSC1=C2C=CC(=CC2=NC=C1)Cl | 0 | |
| SID_381881704 | CC1CN(CC(O1)C)C(=O)C2=C(C3=CC=CC=C3S2)OCC4=CC(=C(C=C4)F)F | 1 | |
| 381885288 | CC1CN(CC(O1)C)C(=O)C2=C(C3=CC=CC=C3S2)Cl | 1 | |
| 381885327 | CC1CN(CC(O1)C)C(=O)C2=C(C3=C(S2)C=C(C=C3)F)Cl | 1 | |
| 381886215 | CC1CN(CC(O1)C)C(=O)C2=NC3=CC=CC=C3S2 | 0 | |
| 381886674 | CC1CN(CC(O1)C)C(=O)C2=C(C3=CC=CC=C3S2)OCC4=CC=CC=C4 | 1 | |
| 381886749 | CC1CN(CC(O1)C)C(=O)C2=C(C3=C(S2)C=C(C=C3)C)Cl | 1 |
Data statistics.
| All Labels | Label ‘Active’ | Label ‘Inactive’ | |
|---|---|---|---|
| # of data | 4794 | 2070 | 2724 |
Figure 2Averaged test set accuracy comparison using feature selection algorithms, against the number of .
Parameter settings of ML models.
| Model | Setting |
|---|---|
| Random Forest | Number of estimators = 100 |
| (RF) | No limitation of depth |
| Minimum samples for splitting = 2 | |
| Support Vector Machine | Kernel = Linear |
| (SVM) | C = 1.0 |
| Extreme Gradient Boosting | Number of estimators = 100 |
| (XGB) | Learning rate = 0.3 |
| Logistic Regression | Penalty = |
| (LR) | C = 1e5 |
| Class weight = None | |
| Multi_class = auto | |
| # of hidden layers = 2 | |
| Artificial Neural Network | # of nodes of each hidden layer = 100 |
| (ANN) | Activation function = Relu [ |
| Optimizer = Adam [ | |
| learning_rate = 0.0001 | |
| # of epochs = 50 with early stopping |
Averaged test set accuracy of ML models, where is the number of all features, and means the number of features selected using the RFE algorithm.
| Model | ||||||
|---|---|---|---|---|---|---|
| RF | 0.8294 | 0.8280 | 0.8256 | 0.8250 | 0.8284 | 0.8258 |
| SVM | 0.7850 | 0.7920 | 0.7964 | 0.8126 | 0.7931 | 0.7695 |
| XGB | 0.8318 | 0.8283 | 0.8342 | 0.8287 | 0.8230 | 0.8177 |
| LR | 0.7795 | 0.7828 | 0.7952 | 0.8111 | 0.7910 | 0.7682 |
| ANN | 0.8223 | 0.8269 | 0.8210 | 0.8283 | 0.8185 | 0.8100 |
Per-label averaged test set precision of ML models, where is the number of all features, means the number of features selected using the RFE algorithm, and ‘Active’ and ‘Inactive’ mean label 1 and 0, respectively.
| Model | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Inactive | Active | Inactive | Active | Inactive | Active | Inactive | Active | Inactive | Active | |
| RF | 0.8053 | 0.8462 | 0.8400 | 0.8712 | 0.7703 | 0.8467 | 0.8015 | 0.8583 | 0.8456 | 0.8457 |
| SVM | 0.7651 | 0.7121 | 0.7925 | 0.7102 | 0.7986 | 0.7801 | 0.8090 | 0.7958 | 0.8063 | 0.7795 |
| XGB | 0.8262 | 0.8020 | 0.8403 | 0.8429 | 0.8259 | 0.8387 | 0.8582 | 0.8477 | 0.8223 | 0.8125 |
| LR | 0.7973 | 0.8033 | 0.8148 | 0.7512 | 0.7958 | 0.7641 | 0.7643 | 0.7085 | 0.7819 | 0.7845 |
| ANN | 0.8381 | 0.8019 | 0.8405 | 0.8090 | 0.8316 | 0.8071 | 0.8433 | 0.8094 | 0.8345 | 0.7970 |
Per-label averaged test set recall of ML models, where is the number of all features, means the number of features selected using the RFE algorithm, and ‘Active’ and ‘Inactive’ mean label 1 and 0, respectively.
| Model | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Inactive | Active | Inactive | Active | Inactive | Active | Inactive | Active | Inactive | Active | |
| RF | 0.9066 | 0.7008 | 0.9131 | 0.7713 | 0.9114 | 0.6427 | 0.9115 | 0.7032 | 0.8908 | 0.7862 |
| SVM | 0.7904 | 0.6812 | 0.7721 | 0.7343 | 0.8456 | 0.7198 | 0.8566 | 0.7343 | 0.8419 | 0.7343 |
| XGB | 0.8566 | 0.7633 | 0.8897 | 0.7778 | 0.8897 | 0.7536 | 0.8897 | 0.8068 | 0.8676 | 0.7536 |
| LR | 0.8676 | 0.7101 | 0.8088 | 0.7585 | 0.8309 | 0.7198 | 0.7868 | 0.6812 | 0.8566 | 0.6860 |
| ANN | 0.8521 | 0.7837 | 0.8589 | 0.7841 | 0.8598 | 0.7696 | 0.8578 | 0.7891 | 0.8494 | 0.7775 |
Per-label averaged test set F1 score of ML models, where is the number of all features, means the number of features selected using the RFE algorithm, and ‘Active’ and ‘Inactive’ mean label 1 and 0, respectively.
| Model | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Inactive | Active | Inactive | Active | Inactive | Active | Inactive | Active | Inactive | Active | |
| RF | 0.8529 | 0.7666 | 0.8750 | 0.8182 | 0.8349 | 0.7306 | 0.8529 | 0.7730 | 0.8676 | 0.8148 |
| SVM | 0.7776 | 0.6963 | 0.7821 | 0.7221 | 0.8214 | 0.7487 | 0.8321 | 0.7638 | 0.8237 | 0.7562 |
| XGB | 0.8412 | 0.7822 | 0.8643 | 0.8090 | 0.8566 | 0.7938 | 0.8736 | 0.8267 | 0.8444 | 0.7820 |
| LR | 0.8310 | 0.7538 | 0.8118 | 0.7548 | 0.8129 | 0.7413 | 0.7754 | 0.6946 | 0.8175 | 0.7320 |
| ANN | 0.8445 | 0.7918 | 0.8493 | 0.7959 | 0.8452 | 0.7874 | 0.8501 | 0.7984 | 0.8417 | 0.7868 |
Top best and worst features selected by the RFE algorithm when = 361.
| Top50 Best Features | Top50 Worst Features | ||||
|---|---|---|---|---|---|
|
|
|
|
|
|
|
| Acidic group count | nAcid | 1 | nHmisc | 1030 | |
| Atom count | nN | 1 | nsLi | 1032 | |
| nO | 1 | nssBe | 1034 | ||
| nP | 1 | nssssBem | 1036 | ||
| Autocorrelation | ATS2m | 1 | Atom type electrotopological state | nsBH2 | 1038 |
| ATS4m | 1 | nssBH | 1040 | ||
| ATS3v | 1 | nsssB | 1042 | ||
| ATS4v | 1 | nssssBm | 1044 | ||
| ATS3e | 1 | nssNH2p | 1076 | ||
| ATS4e | 1 | nssAsH | 1065 | ||
| ATS7e | 1 | nsssAs | 1066 | ||
| ATS8e | 1 | SssBH | 1047 | ||
| ATS3p | 1 | SddsN | 1074 | ||
| ATS4p | 1 | SssAsH | 1070 | ||
| ATS3i | 1 | SsssAs | 1057 | ||
| ATS7i | 1 | SdsssAs | 1058 | ||
| ATS8i | 1 | SsssssAs | 1068 | ||
| ATS3s | 1 | SdSe | 1043 | ||
| ATS5s | 1 | SssSe | 1060 | ||
| AATS6v | 1 | SaaSe | 1059 | ||
| AATS8e | 1 | SssSnH2 | 1035 | ||
| AATS6p | 1 | SsssSnH | 1053 | ||
| AATS4i | 1 | SssPbH2 | 1083 | ||
| AATS6i | 1 | SsssPbH | 1084 | ||
| AATS1s | 1 | minsBH2 | 1073 | ||
| AATS2s | 1 | minssBH | 1069 | ||
| AATS5s | 1 | minssSiH2 | 1077 | ||
| AATS7s | 1 | minsssSiH | 1075 | ||
| AATS8s | 1 | minssssSi | 1080 | ||
| ATSC7c | 1 | minsPH2 | 1082 | ||
| ATSC8c | 1 | minssPH | 1081 | ||
| ATSC3v | 1 | minddsP | 1072 | ||
| ATSC4v | 1 | minsssssP | 1071 | ||
| ATSC6v | 1 | minsGeH3 | 1051 | ||
| ATSC7v | 1 | minssGeH2 | 1052 | ||
| ATSC1e | 1 | minsssAs | 1050 | ||
| ATSC2e | 1 | mindsssAs | 1049 | ||
| ATSC3e | 1 | minddsAs | 1048 | ||
| ATSC4e | 1 | minssSe | 1046 | ||
| ATSC5e | 1 | minaaSe | 1056 | ||
| ATSC6e | 1 | mindssSe | 1055 | ||
| ATSC0p | 1 | minssssssSe | 1054 | ||
| ATSC5p | 1 | minddssSe | 1045 | ||
| ATSC6p | 1 | minsSnH3 | 1041 | ||
| ATSC8p | 1 | minssSnH2 | 1033 | ||
| ATSC1i | 1 | minsssSnH | 1031 | ||
| ATSC4i | 1 | minsPbH3 | 1067 | ||
| ATSC7i | 1 | maxsBH2 | 1078 | ||
| ATSC8i | 1 | maxddsN | 1037 | ||
| ATSC6s | 1 | maxaaS | 1079 | ||
Figure 3All ML models’ test set accuracies.
Summary of comparison with previous studies.
| Samuel Egieyeh et al. [ | Danishuddin, G et al. [ | Our Work | |
|---|---|---|---|
| Total # of data | 1155 | 4750 | 4794 |
| Total # of features | 76 | 98 | 1444 |
| Feature generation tool | RDKit | PaDEL | PaDEL |
| Feature selection | Feature Elimination | RFE | RFE, Kbest |
| Best model | SVM | SVM & XGBoost | ANN & XGB |
| Best accuracy (%) | 85.93 | ∼85.00 | ∼83.00 |