| Literature DB >> 35936743 |
Jiajia Liu1,2,3, Zhihui Zhou1,3,4, Shanshan Kong1,2,3,4,5, Zezhong Ma1,2,5.
Abstract
The optimization of drug properties in the process of cancer drug development is very important to save research and development time and cost. In order to make the anti-breast cancer drug candidates with good biological activity, this paper collected 1974 compounds, firstly, the top 20 molecular descriptors that have the most influence on biological activity were screened by using XGBoost-based data feature selection; secondly, on this basis, take pIC50 values as feature data and use a variety of machine learning algorithms to compare, soas to select a most suitable algorithm to predict the IC50 and pIC50 values. It is preliminarily found that the effects of Random Forest, XGBoost and Gradient-enhanced algorithms are good and have little difference, and the Support vector machine is the worst. Then, using the Semi-automatic parameter adjustment method to adjust the parameters of Random Forest, XGBoost and Gradient-enhanced algorithms to find the optimal parameters. It is found that the Random Forest algorithm has high accuracy and excellent anti over fitting, and the algorithm is stable. Its prediction accuracy is 0.745. Finally, the accuracy of the results is verified by training the model with the preliminarily selected data, which provides an innovative solution for the optimization of the properties of anti- breast cancer drugs, and can provide better support for the early research and development of anti-breast cancer drugs.Entities:
Keywords: anti-breast cancer; bioactivity; parameter optimization; random forest; xgboost
Year: 2022 PMID: 35936743 PMCID: PMC9353770 DOI: 10.3389/fonc.2022.956705
Source DB: PubMed Journal: Front Oncol ISSN: 2234-943X Impact factor: 5.738
Figure 1Molecular descriptor feature weights.
Figure 2Top 20 ranking chart of feature weights.
Range of molecular descriptor values found by the optimization search model.
| Molecular descriptors | Weighting value | Molecular descriptors | Weighting value |
|---|---|---|---|
| C1SP2 | 0.121828 | VC-5 | 0.013187 |
| nC | 0.106756 | minHBint5 | 0.011962 |
| MDEC-23 | 0.089470 | TopoPSA | 0.010527 |
| LipoaffinityIndex | 0.065367 | MDEO-12 | 0.009671 |
| minHsOH | 0.036505 | C3SP2 | 0.009461 |
| nHBAcc | 0.036146 | SHBint8 | 0.009285 |
| SdsN | 0.029560 | maxHBint8 | 0.008846 |
| maxss0 | 0.024947 | ndssC | 0.008720 |
| minss0 | 0.024412 | MDEC-34 | 0.007981 |
| minsOH | 0.022663 | BCUTc-11 | 0.007961 |
Figure 3Algorithm model training process. (A) GBR (B) XGBoost regression (C) SVR (D) Random Forest regression.
Figure 4Semi-automatic parameter adjustment process. (A) Optimization process of Random Forest (B) Optimization process of XGBoost (C) Optimization process of GBR.
Comparison of accuracy before and after parameter adjustment.
| Algorithm | Random Forest regression | XGBoost regression | GBR |
|---|---|---|---|
| Accuracy before parameter adjustment | 0.707829 | 0.678772 | 0.676599 |
| Accuracy after parameter adjustment | 0.745329 | 0.724967 | 0.730603 |
Model evaluation error analysis.
| Evaluation Metrics | MSE | MAE | EV | R² |
|---|---|---|---|---|
| Random Forest | 0.077646 | 0.203693 | 0.961473 | 0.961529 |
| GBR | 0.135665 | 0.268423 | 0.932684 | 0.932684 |
| XGBoost | 0.152522 | 0.290378 | 0.924355 | 0.924319 |
Model test error analysis.
| Evaluation Metrics | MAE | RMSE | MAPE | EV |
|---|---|---|---|---|
| Random Forest | 0.544467 | 0.135665 | 0.089044 | 0.715558 |
| GBR | 0.557161 | 0.135665 | 0.090808 | 0.701300 |
| XGBoost | 0.567658 | 0.135665 | 0.091912 | 0.687493 |
Figure 5Algorithm model regression test results. (A) GBR (B) XGBoost regression (C) Random Forest regression.
Figure 6Algorithm model test results.
Final prediction accuracy.
| Algorithm | Random Forest regression | GBR | XGBoost regression |
|---|---|---|---|
| Accuracy | 0.766850 | 0.758679 | 0.756774 |