| Literature DB >> 36123643 |
Yue Gao1,2, Songling Chen1,2, Junyi Tong3, Xiangling Fu4,5.
Abstract
BACKGROUND: Breast cancer is currently one of the cancers with a higher mortality rate in the world. The biological research on anti-breast cancer drugs focuses on the activity of estrogen receptors alpha (ER[Formula: see text]), the pharmacokinetic properties and the safety of the compounds, which, however, is an expensive and time-consuming process. Developments of deep learning bring potential to efficiently facilitate the candidate drug selection against breast cancer.Entities:
Keywords: Bioinformatics; Breast cancer; Decision support system; Deep learning; Drug prediction; Feature engineering; Graph neural network; Molecular representation
Mesh:
Substances:
Year: 2022 PMID: 36123643 PMCID: PMC9484163 DOI: 10.1186/s12859-022-04913-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Fig. 1The pipeline of the whole candidate drug selection method
Fig. 2Framework of the topological molecular graph representation for the ABCD-GGNN representation method
Descriptions of components of the feature initialization for the atomic nodes
| Atomic descriptor | Description | Vector size |
|---|---|---|
| Atom type | 12 types of atoms in the 200 molecules of the dataset | 12-digit 0/1 vector |
| Number of bonds | The number of chemical bonds that the atom participates in | 6-digit 0/1 vector |
| Formal charge | The integer-form electric nucleus of the atom | 5-digit 0/1 vector |
| Chirality | CW, CCW, unspecified, or other | 4-digit 0/1 vector |
| Hydrogen bound number | Atomic bound hydrogen atom charge | 5-digit 0/1 vector |
| Hybridization | sp, sp2, sp3, sp3d, or sp3d2 | 5-digit 0/1 vector |
| Aromaticity | Whether the atom is part of an aromatic hydrocarbon | 1-digit 0/1 vector |
| Atom mass | The mass of the atom | A normalized number between 0 and 1 |
Performance comparison on the prediction of ER
| Model | MSE | R2 |
|---|---|---|
| Linear Regression | 2.156 | 0.276 |
| Random Forest | 0.5147 | 0.6133 |
| SVM | 0.6878 | 0.6273 |
| ABCD-GGNN | 0.4811 | 0.7741 |
We run all models 10 times and report the mean test MSE and R2
Performance comparison on the prediction of ADMET
| Model | Dataset | Precision | Recall | F1 | AUC | AUPR |
|---|---|---|---|---|---|---|
| SVM | MN | 0.7843 | 0.6709 | 0.6943 | 0.7957 | 0.8209 |
| HOB | 0.7733 | 0.7498 | 0.7607 | 0.8104 | 0.6239 | |
| hERG | 0.8080 | 0.7589 | 0.7791 | 0.8239 | 0.8494 | |
| CYP3A4 | 0.8397 | 0.7998 | 0.8133 | 0.8518 | 0.8591 | |
| Caco-2 | 0.8453 | 0.7807 | 0.8068 | 0.8552 | 0.7525 | |
| BiLSTM | MN | 0.8226 | 0.7310 | 0.7537 | 0.8195 | 0.7731 |
| HOB | 0.7462 | 0.7008 | 0.7165 | 0.7711 | 0.7337 | |
| hERG | 0.8350 | 0.7914 | 0.7968 | 0.8452 | 0.8196 | |
| CYP3A4 | 0.8838 | 0.8627 | 0.8741 | 0.9129 | 0.8952 | |
| Caco-2 | 0.8134 | 0.7954 | 0.8021 | 0.8533 | 0.8258 | |
| Graph-CNN | MN | 0.8629 | 0.8293 | 0.8461 | 0.8710 | 0.8623 |
| HOB | 0.8110 | 0.7635 | 0.7824 | 0.8369 | 0.8061 | |
| hERG | 0.8495 | 0.8690 | 0.8556 | 0.9081 | 0.8585 | |
| CYP3A4 | 0.8913 | 0.8827 | 0.8840 | 0.9304 | 0.8731 | |
| Caco-2 | 0.8479 | 0.8227 | 0.8306 | 0.8740 | 0.8881 | |
| ABCD-GGNN | MN | 0.9255 | 0.9613 | 0.9430 | 0.9714 | 0.9862 |
| HOB | 0.8637 | 0.8804 | 0.8712 | 0.9130 | 0.9273 | |
| hERG | 0.8914 | 0.8839 | 0.8842 | 0.9303 | 0.9456 | |
| CYP3A4 | 0.9474 | 0.9163 | 0.9355 | 0.9487 | 0.9322 | |
| Caco-2 | 0.8828 | 0.8832 | 0.8829 | 0.9296 | 0.9134 |
We run all models 10 times and report the mean test precision, recall, F1, AUC, and AUPR
Statistics of the runtime (s) on both ER value prediction and ADMET property prediction tasks
| ER | ADMET property prediction | ||
|---|---|---|---|
| Method | Runtime | Method | Runtime |
| Linear Regression | 0.0937 | SVM | 3.7634 |
| Random Forest | 3.9162 | Bi-LSTM | 19.0383 |
| SVM | 3.4928 | Graph-CNN | 62.8520 |
| ABCD-GGNN | 73.4433 | ABCD-GGNN | 76.1681 |
Ablation study to demonstrate the impact of discrete descriptor representation and topological graph representation for ABCD-GGNN on the ADMET prediction task
| Model | Dataset | Precision | Recall | F1 |
|---|---|---|---|---|
| Discrete molecular descriptor representation (w/o) | MN | 0.8942 | 0.8763 | 0.8823 |
| HOB | 0.8392 | 0.8550 | 0.8439 | |
| hERG | 0.8547 | 0.8631 | 0.8561 | |
| CYP3A4 | 0.9274 | 0.9104 | 0.8967 | |
| Caco-2 | 0.8584 | 0.8722 | 0.8646 | |
| Molecular graph representation (w/o) | MN | 0.7986 | 0.7316 | 0.7471 |
| HOB | 0.8006 | 0.8348 | 0.8219 | |
| hERG | 0.7618 | 0.7092 | 0.7153 | |
| CYP3A4 | 0.8718 | 0.8026 | 0.8193 | |
| Caco-2 | 0.8162 | 0.8023 | 0.8025 | |
| ABCD-GGNN | MN | 0.9255 | 0.9613 | 0.9430 |
| HOB | 0.8637 | 0.8804 | 0.8712 | |
| hERG | 0.8914 | 0.8839 | 0.8842 | |
| CYP3A4 | 0.9474 | 0.9163 | 0.9355 | |
| Caco-2 | 0.8828 | 0.8832 | 0.8829 |
We run all models 10 times and report the mean test precision, recall, and F1
Fig. 3Precision of ABCD-GGNN with a varying on ADMET prediction tasks
Ablation study on the pooling operation in the readout stage of ABCD-GGNN for ADMET prediction
| Pooling operation | MN | HOB | hERG | CYP3A4 | Caco-2 |
|---|---|---|---|---|---|
| Average pooling | 0.9173 | 0.8586 | 0.8840 | 0.9329 | 0.8751 |
| Max pooling | 0.9086 | 0.8514 | 0.8792 | 0.9245 | 0.8684 |
| Fusion | 0.9255 | 0.8637 | 0.8914 | 0.9474 | 0.8828 |
We run all models 10 times and report the mean test precision
Fig. 4Precision of the molecular graph representation part of ABCD-GGNN with a varying interaction step on ADMET prediction tasks
Fig. 5The score list and heatmap of the 50 molecular descriptors selected from the XGBoost in the stage of discrete molecular descriptor representation. a Score list, b heatmap
Fig. 6Visualization of the clustering analysis on the results of the ranking operator. a Cluster heatmap, the correlation of clustered samples is stronger, b k-means clustering analysis
Fig. 7The scoring result of the candidate drugs through the ranking operator