| Literature DB >> 31510691 |
Jiansheng Wu1,2, Ben Liu3, Wallace K B Chan4, Weijian Wu5, Tao Pang6, Haifeng Hu3, Shancheng Yan1,2, Xiaoyan Ke7, Yang Zhang8,9.
Abstract
MOTIVATION: Accurate prediction and interpretation of ligand bioactivities are essential for virtual screening and drug discovery. Unfortunately, many important drug targets lack experimental data about the ligand bioactivities; this is particularly true for G protein-coupled receptors (GPCRs), which account for the targets of about a third of drugs currently on the market. Computational approaches with the potential of precise assessment of ligand bioactivities and determination of key substructural features which determine ligand bioactivities are needed to address this issue.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31510691 PMCID: PMC6612825 DOI: 10.1093/bioinformatics/btz336
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Schematic of SED. The approach is composed of three stages: long extended-connectivity fingerprint (ECFP) representation for ligand molecules, feature selection by screening for Lasso and construction of deep neural network regression prediction models
Performance of deep neural networks with top features selected from various sizes of long ECFPs
| Group | GPCRs | EC | Baseline | Top 300 features selected from various sizes | ||||
|---|---|---|---|---|---|---|---|---|
| 1024 | 1024 | 5120 | 10 240 | 51 200 | 102 400 | |||
| I | P08908 |
| 0.9268 | 0.9249 | 0.9310 |
| 0.9227 | 0.9127 |
| RMSE(↓) | 1.0483 | 1.0878 | 0.9968 |
| 1.0636 | 1.0982 | ||
| Q9Y5N1 |
| 0.9513 | 0.9464 | 0.9468 |
| 0.9272 | 0.921 | |
| RMSE(↓) | 1.0218 | 0.9627 | 0.9748 |
| 1.0827 | 1.0889 | ||
| P28335 |
| 0.9096 | 0.9066 | 0.8989 |
| 0.8983 | 0.8903 | |
| RMSE(↓) | 1.1475 | 1.1335 | 1.1533 |
| 1.1549 | 1.1723 | ||
| P35372 |
| 0.9034 | 0.8968 | 0.8966 | 0.8954 | 0.8796 | 0.8814 | |
| RMSE(↓) | 1.2931 | 1.3478 | 1.1616 |
| 1.2367 | 1.2384 | ||
| Q99705 |
| 0.9389 | 0.931 | 0.9393 |
| 0.9295 | 0.9327 | |
| RMSE(↓) | 1.1132 | 1.2236 | 0.9649 |
| 0.9464 | 0.9351 | ||
| P0DMS8 |
| 0.8937 | 0.8859 | 0.8864 |
| 0.8781 | 0.8555 | |
| RMSE(↓) | 1.1979 | 1.2348 | 1.1987 |
| 1.2572 | 1.3375 | ||
| Q16602 |
| 0.9268 | 0.9326 | 0.9514 |
| 0.9516 | 0.9527 | |
| RMSE(↓) | 1.2783 | 1.8135 | 1.6057 | 1.4746 | 1.4675 | 1.3730 | ||
| P51677 |
| 0.9329 | 0.9216 | 0.9338 |
| 0.9211 | 0.9161 | |
| RMSE(↓) | 1.0194 | 1.2781 | 1.0674 | 1.0280 | 1.0048 | 1.0989 | ||
| P48039 |
| 0.9180 | 0.9209 | 0.9108 | 0.9147 | 0.9126 | 0.908 | |
| RMSE(↓) | 1.4047 | 1.4495 | 1.4607 |
| 1.3699 | 1.3831 | ||
| II | Q9H228 |
| 0.8152 | 0.8636 | 0.8789 | 0.8870 |
| 0.8942 |
| RMSE(↓) | 1.6521 | 1.3965 | 1.5009 | 1.372 |
| 1.3239 | ||
| Q8TDU6 |
| 0.8830 | 0.9124 |
| 0.9206 | 0.9165 | 0.9077 | |
| RMSE(↓) | 1.3289 | 1.1804 |
| 1.0906 | 1.1056 | 1.1713 | ||
| Q8TDS4 |
| 0.9154 | 0.9262 | 0.929 | 0.9222 |
| 0.9348 | |
| RMSE(↓) | 1.0707 | 1.0445 | 1.1328 | 1.1051 |
| 0.9906 | ||
| Q9HC97 |
| 0.6047 | 0.7097 | 0.7649 |
| 0.8264 | 0.7801 | |
| RMSE(↓) | 1.7889 | 1.5855 | 1.6228 | 1.3631 | 1.3282 | 1.4242 | ||
| P41180 |
| 0.7784 | 0.7916 | 0.8253 |
| 0.8029 | 0.8217 | |
| RMSE(↓) | 1.9226 | 1.7581 | 1.7082 |
| 1.5869 | 1.5510 | ||
| Q14833 |
| 0.7429 | 0.7682 |
| 0.7743 | 0.7424 | 0.7302 | |
| RMSE(↓) | 1.6512 | 1.5453 |
| 1.4754 | 1.6216 | 1.6719 | ||
| Q99835 |
| 0.8203 | 0.8790 | 0.892 | 0.8933 | 0.8999 |
| |
| RMSE(↓) | 1.5439 | 1.3669 | 1.1953 | 1.1924 | 1.155 |
| ||
Group I: original number of ligands >600; II: original number of ligands ≤600.
Evaluation Criterion: ↑ (↓) indicates that larger (smaller) values are better; the best results for each evaluation criterion are highlighted in boldface.
Baseline: full-length ECFPs with 1024 bits.
Indicates that the performance of the method using the top 300 ECFP features selected from various ECFPs is significantly better than that of the baseline methods based on Wilcoxon signed-rank test.
Fig. 2.Effect of regression model on performance. GBDT, Gradient Boosting Decision Tree; SVR, Support Vector Regression; RF, Random Forest; DNN, deep neural network. (A): P08908; (B): Q9Y5N1; (C): P28335; (D): P35372; (E): Q99705; (F): P0DMS8; (G): Q16602; (H): P51677; (I): P48039; (J): Q9H228; (K): Q8TDU6; (L): Q8TDS4; (M): Q9HC97; (N): P41180; (O): Q14833; (P): Q99835
Fig. 3.Dependence of SED performance on the number of selected features. (A): P08908; (B): Q9Y5N1; (C): P28335; (D): P35372; (E): Q99705; (F): P0DMS8; (G): Q16602; (H): P51677; (I): P48039; (J): Q9H228; (K): Q8TDU6; (L): Q8TDS4; (M): Q9HC97; (N): P41180; (O): Q14833; (P): Q99835
Fig. 4.Pearson correlation analysis of selected features. T300, T100 and T50: The top 300, 100 and 50 features identified by screening for Lasso. R300: the 300 features randomly selected from all dimensions of the ECFPs. (A): P08908; (B): Q9Y5N1; (C): P28335; (D): P35372; (E): Q99705; (F): P0DMS8; (G): Q16602; (H): P51677; (I): P48039; (J): Q9H228; (K): Q8TDU6; (L): Q8TDS4; (M): Q9HC97; (N): P41180; (O): Q14833; (P): Q99835
Top 50 substructures identified by SED along with the associated Pearson correlation coefficients
|
|