| Literature DB >> 35733966 |
Bin Yang1, Wenzheng Bao2, Baitong Chen3.
Abstract
In order to screen the disease-related compounds of a traditional Chinese medicine prescription in network pharmacology research accurately, a new virtual screening method based on flexible neural tree (FNT) model, hybrid evolutionary method and negative sample selection algorithm is proposed. A novel hybrid evolutionary algorithm based on the Grammar-guided genetic programming and salp swarm algorithm is proposed to infer the optimal FNT. According to hypertension, diabetes, and Corona Virus Disease 2019, disease-related compounds are collected from the up-to-date literatures. The unrelated compounds are chosen by negative sample selection algorithm. ECFP6, MACCS, Macrocycle, and RDKit are utilized to numerically characterize the chemical structure of each compound collected, respectively. The experiment results show that our proposed method performs better than classical classifiers [Support Vector Machine (SVM), random forest (RF), AdaBoost, decision tree (DT), Gradient Boosting Decision Tree (GBDT), KNN, logic regression (LR), and Naive Bayes (NB)], up-to-date classifier (gcForest), and deep learning method (forgeNet) in terms of AUC, ROC, TPR, FPR, Precision, Specificity, and F1. MACCS method is suitable for the maximum number of classifiers. All methods perform poorly with ECFP6 molecular descriptor.Entities:
Keywords: flexible neural tree; grammar-guided genetic programming; network pharmacology; salp swarm algorithm; virtual screening
Year: 2022 PMID: 35733966 PMCID: PMC9207514 DOI: 10.3389/fmicb.2022.912145
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 6.064
FIGURE 1An example of flexible neural tree.
FIGURE 2A flexible neuron operator.
FIGURE 3The flowchart of screening disease-related compounds algorithm.
Negative sample selection algorithm.
| the generated decoy set [ |
| for |
| |
| for |
| |
| |
| End |
| End |
| Sort the decoy set according to [ |
| Select the decoys with 2 |
FIGURE 4AUC performances of 11 methods with hypertension dataset.
FIGURE 6AUC performances of 11 methods with COVID-19 dataset.
FIGURE 5AUC performances of 11 methods with diabetes dataset.
Prediction performances of 11 methods with hypertension dataset.
| Molecular descriptors | Methods | TPR | FPR | Precision | Specificity | F1 |
| ECFP6 | Our method |
| 0.022222 | 0.956522 | 0.977778 |
|
| gcForest | 0.955224 | 0.155556 | 0.752941 | 0.844444 | 0.842105 | |
| forgeNet | 0.895522 |
|
|
| 0.944882 | |
| SVM | 0.880597 | 0.007407 | 0.983333 | 0.992593 | 0.929134 | |
| RF | 0.880597 |
|
|
| 0.936508 | |
| AdaBoost | 0.835821 | 0.037037 | 0.918033 | 0.962963 | 0.875 | |
| DT | 0.835821 | 0.044444 | 0.903226 | 0.955556 | 0.868217 | |
| GBDT | 0.850746 | 0.051852 | 0.890625 | 0.948148 | 0.870229 | |
| KNN | 0.686567 |
|
|
| 0.814159 | |
| LR | 0.970149 | 0.311111 | 0.607477 | 0.688889 | 0.747126 | |
| NB | 0.731343 | 0.096296 | 0.790323 | 0.903704 | 0.75969 | |
| MACCS | Our method |
|
|
|
|
|
| gcForest | 0.970149 | 0.051852 | 0.902778 | 0.948148 | 0.935252 | |
| forgeNet | 0.925373 | 0.018587 | 0.96124 | 0.981413 | 0.942966 | |
| SVM | 0.940299 | 0.02963 | 0.940299 | 0.97037 | 0.940299 | |
| RF | 0.940299 | 0.014815 | 0.969231 | 0.985185 | 0.954545 | |
| AdaBoost | 0.895522 | 0.044444 | 0.909091 | 0.955556 | 0.902256 | |
| DT | 0.895522 | 0.051852 | 0.895522 | 0.948148 | 0.895522 | |
| GBDT | 0.925373 | 0.014815 | 0.96875 | 0.985185 | 0.946565 | |
| KNN | 0.925373 | 0.02963 | 0.939394 | 0.97037 | 0.932331 | |
| LR | 0.970149 | 0.066667 | 0.878378 | 0.933333 | 0.921986 | |
| NB | 0.940299 | 0.192593 | 0.707865 | 0.807407 | 0.807692 | |
| Macrocycle | Our method |
|
|
|
|
|
| gcForest | 0.9375 | 0.09009 | 0.857143 | 0.90991 | 0.895522 | |
| forgeNet | 0.921875 | 0.018018 | 0.967213 | 0.981982 | 0.944 | |
| SVM | 0.890625 | 0.027027 | 0.95 | 0.972973 | 0.919355 | |
| RF | 0.90625 | 0.027027 | 0.95082 | 0.972973 | 0.928 | |
| AdaBoost | 0.953125 | 0.027027 | 0.953125 | 0.972973 | 0.953125 | |
| DT | 0.921875 | 0.072072 | 0.880597 | 0.927928 | 0.900763 | |
| GBDT | 0.90625 | 0.036036 | 0.935484 | 0.963964 | 0.920635 | |
| KNN | 0.921875 | 0.072072 | 0.880597 | 0.927928 | 0.900763 | |
| LR | 0.9375 | 0.153153 | 0.779221 | 0.846847 | 0.851064 | |
| NB | 0.9375 | 0.09009 | 0.857143 | 0.90991 | 0.895522 | |
| RDKit | Our method |
|
|
|
|
|
| gcForest | 0.955224 | 0.02963 | 0.941176 | 0.97037 | 0.948148 | |
| forgeNet | 0.895522 | 0.022222 | 0.952381 | 0.977778 | 0.923077 | |
| SVM | 0.940299 | 0.014815 | 0.969231 | 0.985185 | 0.954545 | |
| RF | 0.865672 | 0.014815 | 0.966667 | 0.985185 | 0.913386 | |
| AdaBoost | 0.925373 | 0.014815 | 0.96875 | 0.985185 | 0.946565 | |
| DT | 0.873134 | 0.055762 | 0.886364 | 0.944238 | 0.879699 | |
| GBDT | 0.895522 | 0.02963 | 0.9375 | 0.97037 | 0.916031 | |
| KNN | 0.865672 | 0.044444 | 0.90625 | 0.955556 | 0.885496 | |
| LR | 0.955224 | 0.02963 | 0.941176 | 0.97037 | 0.948148 | |
| NB | 0.895522 | 0.214815 | 0.674157 | 0.785185 | 0.769231 |
Bold values denote the best performances.
Prediction performances of 11 methods with COVID-19 dataset.
| Molecular descriptors | Methods | TPR | FPR | Precision | Specificity | F1 |
| ECFP6 | Our method |
|
|
|
|
|
| gcForest |
| 0.101695 | 0.825243 | 0.898305 | 0.890052 | |
| forgeNet | 0.931818 | 0.00565 | 0.987952 | 0.99435 | 0.959064 | |
| SVM | 0.920455 | 0.011299 | 0.975904 | 0.988701 | 0.947368 | |
| RF | 0.931818 |
|
|
| 0.964706 | |
| AdaBoost | 0.896226 | 0.025882 | 0.945274 | 0.974118 | 0.920097 | |
| DT | 0.909091 | 0.045198 | 0.909091 | 0.954802 | 0.909091 | |
| GBDT | 0.886364 | 0.028249 | 0.939759 | 0.971751 | 0.912281 | |
| KNN | 0.897727 | 0.435028 | 0.50641 | 0.564972 | 0.647541 | |
| LR | 0.988636 | 0.214689 | 0.696 | 0.785311 | 0.816901 | |
| NB | 0.636364 | 0.062147 | 0.835821 | 0.937853 | 0.722581 | |
| MACCS | Our method |
|
|
|
|
|
| gcForest | 0.954545 | 0.011299 | 0.976744 | 0.988701 | 0.965517 | |
| forgeNet | 0.943182 | 0.008499 | 0.982249 | 0.991501 | 0.962319 | |
| SVM | 0.931818 | 0.011299 | 0.97619 | 0.988701 | 0.953488 | |
| RF | 0.954545 |
|
|
| 0.976744 | |
| AdaBoost | 0.886364 | 0.016949 | 0.962963 | 0.983051 | 0.923077 | |
| DT | 0.931818 | 0.033898 | 0.931818 | 0.966102 | 0.931818 | |
| GBDT | 0.931818 | 0.00565 | 0.987952 | 0.99435 | 0.959064 | |
| KNN | 0.954545 | 0.028249 | 0.94382 | 0.971751 | 0.949153 | |
| LR | 0.954545 | 0.016949 | 0.965517 | 0.983051 | 0.96 | |
| NB | 0.863636 | 0.090395 | 0.826087 | 0.909605 | 0.844444 | |
| Macrocycle | Our method |
|
|
|
|
|
| gcForest | 0.954023 | 0.006536 | 0.988095 | 0.993464 | 0.97076 | |
| forgeNet | 0.954023 |
|
|
| 0.976471 | |
| SVM | 0.942529 | 0.006536 | 0.987952 | 0.993464 | 0.964706 | |
| RF | 0.942529 | 0.006536 | 0.987952 | 0.993464 | 0.964706 | |
| AdaBoost | 0.954023 |
|
|
| 0.976471 | |
| DT | 0.908046 | 0.039216 | 0.929412 | 0.960784 | 0.918605 | |
| GBDT | 0.896552 | 0.03268 | 0.939759 | 0.96732 | 0.917647 | |
| KNN | 0.931034 | 0.019608 | 0.964286 | 0.980392 | 0.947368 | |
| LR | 0.954023 | 0.026144 | 0.954023 | 0.973856 | 0.954023 | |
| NB | 0.885057 | 0.039216 | 0.927711 | 0.960784 | 0.905882 | |
| RDKit | Our method |
|
|
|
|
|
| gcForest | 0.943182 | 0.022599 | 0.954023 | 0.977401 | 0.948571 | |
| forgeNet | 0.943182 | 0.011299 | 0.976471 | 0.988701 | 0.959538 | |
| SVM | 0.943182 | 0.011299 | 0.976471 | 0.988701 | 0.959538 | |
| RF | 0.931818 | 0.00565 | 0.987952 | 0.99435 | 0.959064 | |
| AdaBoost | 0.931818 | 0.016949 | 0.964706 | 0.983051 | 0.947977 | |
| DT | 0.943182 | 0.011299 | 0.976471 | 0.988701 | 0.959538 | |
| GBDT | 0.943182 | 0.011299 | 0.976471 | 0.988701 | 0.959538 | |
| KNN | 0.954545 | 0.016949 | 0.965517 | 0.983051 | 0.96 | |
| LR | 0.943182 | 0.028249 | 0.943182 | 0.971751 | 0.943182 | |
| NB | 0.897727 | 0.112994 | 0.79798 | 0.887006 | 0.84492 |
Bold values denote the best performances.
Prediction performances of 11 methods with diabetes dataset.
| Molecular descriptors | Methods | TPR | FPR | Precision | Specificity | F1 |
| ECFP6 | Our method | 0.991935 | 0.012048 | 0.97619 | 0.987952 |
|
| gcForest | 0.967742 | 0.124498 | 0.794702 | 0.875502 | 0.872727 | |
| forgeNet | 0.916031 |
|
|
| 0.948617 | |
| SVM | 0.935484 | 0.02008 | 0.958678 | 0.97992 | 0.946939 | |
| RF | 0.862903 | 0.008032 | 0.981651 | 0.991968 | 0.918455 | |
| AdaBoost | 0.879032 | 0.036145 | 0.923729 | 0.963855 | 0.900826 | |
| DT | 0.806452 | 0.100402 | 0.8 | 0.899598 | 0.803213 | |
| GBDT | 0.854839 | 0.02008 | 0.954955 | 0.97992 | 0.902128 | |
| KNN |
| 0.939759 | 0.346369 | 0.060241 | 0.514523 | |
| LR | 0.967742 | 0.15261 | 0.759494 | 0.84739 | 0.851064 | |
| NB | 0.604839 | 0.052209 | 0.852273 | 0.947791 | 0.707547 | |
| MACCS | Our method |
|
|
|
|
|
| gcForest |
| 0.02008 | 0.960317 | 0.97992 | 0.968 | |
| forgeNet | 0.951613 | 0.024096 | 0.951613 | 0.975904 | 0.951613 | |
| SVM | 0.935484 | 0.024096 | 0.95082 | 0.975904 | 0.943089 | |
| RF | 0.943548 | 0.012048 | 0.975 | 0.987952 | 0.959016 | |
| AdaBoost | 0.943548 | 0.032129 | 0.936 | 0.967871 | 0.939759 | |
| DT | 0.951613 | 0.040161 | 0.921875 | 0.959839 | 0.936508 | |
| GBDT | 0.975806 | 0.02008 | 0.960317 | 0.97992 | 0.968 | |
| KNN | 0.951613 | 0.044177 | 0.914729 | 0.955823 | 0.932806 | |
| LR | 0.975806 | 0.02008 | 0.960317 | 0.97992 | 0.968 | |
| NB | 0.967742 | 0.417671 | 0.535714 | 0.582329 | 0.689655 | |
| Macrocycle | Our method |
|
|
|
|
|
| gcForest | 0.982906 | 0.028037 | 0.950413 | 0.971963 | 0.966387 | |
| forgeNet | 0.957265 | 0.009346 | 0.982456 | 0.990654 | 0.969697 | |
| SVM | 0.974359 | 0.018692 | 0.966102 | 0.981308 | 0.970213 | |
| RF | 0.957265 | 0.014019 | 0.973913 | 0.985981 | 0.965517 | |
| AdaBoost | 0.957265 | 0.018692 | 0.965517 | 0.981308 | 0.961373 | |
| DT | 0.91453 | 0.037383 | 0.930435 | 0.962617 | 0.922414 | |
| GBDT | 0.965812 | 0.046729 | 0.918699 | 0.953271 | 0.941667 | |
| KNN | 0.923077 | 0.018692 | 0.964286 | 0.981308 | 0.943231 | |
| LR | 0.982906 | 0.042056 | 0.927419 | 0.957944 | 0.954357 | |
| NB | 0.974359 | 0.042056 | 0.926829 | 0.957944 | 0.95 | |
| RDKit | Our method | 0.959677 |
|
|
|
|
| gcForest | 0.959677 | 0.02008 | 0.959677 | 0.97992 | 0.959677 | |
| forgeNet |
| 0.012048 | 0.97561 | 0.987952 | 0.97166 | |
| SVM | 0.951613 | 0.008032 | 0.983333 | 0.991968 | 0.967213 | |
| RF | 0.935484 | 0.012048 | 0.97479 | 0.987952 | 0.954733 | |
| AdaBoost | 0.943548 | 0.016064 | 0.966942 | 0.983936 | 0.955102 | |
| DT | 0.943548 | 0.028112 | 0.943548 | 0.971888 | 0.943548 | |
| GBDT | 0.943548 | 0.008032 | 0.983193 | 0.991968 | 0.962963 | |
| KNN | 0.903226 | 0.012048 | 0.973913 | 0.987952 | 0.937238 | |
| LR | 0.959677 | 0.024096 | 0.952 | 0.975904 | 0.955823 | |
| NB | 0.951613 | 0.204819 | 0.698225 | 0.795181 | 0.805461 |
Averaged ranking scores of 11 methods with 3 datasets.
| ECFP6 | MACCS | Macrocycle | RDKit | |
| Our method | 3.33 |
| 2 | 2.67 |
| gcForest | 3.67 |
| 2.33 | 2.17 |
| forgeNet | 2.5 |
| 2.33 | 3 |
| SVM | 2.83 | 2.5 | 2.5 |
|
| RF | 2.83 |
| 2.5 | 3.17 |
| AdaBoost | 3.5 | 2.5 |
| 2.17 |
| DT | 4 | 1.83 | 2.5 |
|
| GBDT | 3.5 |
| 2.83 | 2.33 |
| KNN | 4 | 1.83 |
| 2.5 |
| LR | 3.83 |
| 2.83 | 2.17 |
| NB | 3.67 | 2.83 |
| 2.33 |
FIGURE 7Performances of our method with COVID-19 dataset and the different ratios of positive and negative samples.