| Literature DB >> 29271922 |
Liang-Yong Xia1, Yu-Wei Wang2, De-Yu Meng3, Xiao-Jun Yao4, Hua Chai5, Yong Liang6.
Abstract
The quantitative structure-activity relationship (QSAR) model searches for a reliable relationship between the chemical structure and biological activities in the field of drug design and discovery. (1) Background: In the study of QSAR, the chemical structures of compounds are encoded by a substantial number of descriptors. Some redundant, noisy and irrelevant descriptors result in a side-effect for the QSAR model. Meanwhile, too many descriptors can result in overfitting or low correlation between chemical structure and biological bioactivity. (2)Entities:
Keywords: QSAR; biological activity; descriptor selection; log-sum; regularization
Mesh:
Year: 2017 PMID: 29271922 PMCID: PMC5795980 DOI: 10.3390/ijms19010030
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1The flow diagram shows the process of QSAR modeling. (1) Collecting molecular structures and their activities; (2) calculating molecular descriptors, which can produce thousands of parameters for each molecular structure; (3) removing redundant or irrelevant descriptors via descriptor selection; (4) building the model with the optimum descriptor subset; (5) predicting the biological activity of a new molecular structure using the established model. Different color blocks represent different values.
Figure 2and are convex, and SCAD, MCP, and log-sum are non-convex. The log-sum approximates to .
Figure 3Plot of thresholding functions for: (a) ; (b) ; (c) SCAD; (d) MCP; (e) ; and (f) log-sum.
A brief description of four public datasets used in the experiments.
| Dataset Name | No. of Samples | No. of Descriptors | No. of Samples (Training) | No. of Samples (Test) |
|---|---|---|---|---|
| 97 | 1083 | 78 | 19 | |
| 129 | 1089 | 104 | 25 | |
| 250 | 1120 | 200 | 50 | |
| 508 | 1562 | 407 | 101 |
The average number of variables selected in total by , , SCAD, MCP, and log-sum. In bold, the best performance is shown.
| Sample Size | SCAD | MCP | Log-Sum | ||||
|---|---|---|---|---|---|---|---|
| 381.60 | 92.92 | 19.09 | 23.36 | 19.13 | |||
| 498.81 | 34.18 | 19.03 | 19.09 | ||||
| 382.24 | 93.26 | 27.74 | 25.79 | 21.77 | |||
| 499.49 | 95.83 | 36.48 | 23.65 | 23.83 | |||
| 378.96 | 93.98 | 19.26 | 24.67 | 19.98 | |||
| 495.66 | 97.51 | 40.87 | 24.04 | 24.42 | |||
| 379.35 | 93.46 | 29.22 | 26.08 | 22.48 | |||
| 495.64 | 98.97 | 40.61 | 23.95 | 24.43 |
The average number of variables selected with a pre-set value (20) obtained by , , SCAD, MCP, and log-sum.
| Sample Size | SCAD | MCP | Log-Sum | ||||
|---|---|---|---|---|---|---|---|
| 12.23 | 14.45 | 19.09 | 18.81 | 19.13 | 19.00 | ||
| 16.22 | 20.00 | 19.03 | 19.00 | 19.09 | 19.00 | ||
| 12.24 | 14.30 | 19.93 | 19.42 | 19.74 | 19.81 | ||
| 16.26 | 20.00 | 20.00 | 20.00 | 20.00 | 20.00 | ||
| 11.84 | 13.57 | 18.88 | 18.40 | 18.65 | 18.88 | ||
| 15.79 | 19.99 | 19.97 | 19.93 | 19.96 | 19.93 | ||
| 11.88 | 13.55 | 19.48 | 18.81 | 19.14 | 19.00 | ||
| 15.80 | 19.99 | 19.98 | 19.93 | 19.97 | 19.95 |
The average accuracy (%) for the simulation data sets obtained by , , SCAD, MCP, and log-sum. In bold, the best performance is shown.
| Sample Size | SCAD | MCP | Log-Sum | ||||
|---|---|---|---|---|---|---|---|
| 3.20% | 15.55% | 80.52% | |||||
| 3.25% | 58.51% | ||||||
| 3.12% | 14.44% | 98.03% | 74.58% | 93.34% | |||
| 3.19% | 20.50% | 48.86% | 82.90% | 81.74% | |||
| 3.20% | 15.33% | 71.85% | 75.30% | 90.68% | |||
| 3.26% | 20.87% | 54.87% | 84.57% | 83.93% | |||
| 3.19% | 20.50% | 48.86% | 82.90% | 81.74% | |||
| 3.19% | 20.20% | 49.20% | 83.22% | 81.74% |
Experimental results on the four datasets (the results are emphasized by our proposed method in bold and italic).
| Datasets | Methods | ||||||
|---|---|---|---|---|---|---|---|
| 0.87 | 0.65 | 0.74 | 0.68 | 0.74 | 0.90 | ||
| 0.87 | 0.64 | 0.75 | 0.67 | 0.74 | 0.90 | ||
| SCAD | 0.84 | 0.71 | 0.82 | 0.62 | 0.72 | 0.93 | |
| MCP | 0.85 | 0.68 | 0.80 | 0.65 | 0.73 | 0.91 | |
| 0.82 | 0.75 | 0.81 | 0.62 | 0.72 | 0.92 | ||
| 0.81 | 0.74 | 0.70 | 0.70 | 0.64 | 1.23 | ||
| 0.82 | 0.73 | 0.73 | 0.68 | 0.63 | 1.25 | ||
| SCAD | 0.86 | 0.63 | 0.74 | 0.69 | 0.70 | 1.12 | |
| MCP | 0.83 | 0.70 | 0.74 | 0.69 | 0.65 | 1.21 | |
| 0.87 | 0.62 | 0.75 | 0.65 | 0.64 | 1.24 | ||
| 0.87 | 0.28 | 0.73 | 0.30 | 0.60 | 0.52 | ||
| 0.88 | 0.28 | 0.74 | 0.30 | 0.60 | 0.52 | ||
| SCAD | 0.86 | 0.30 | 0.77 | 0.30 | 0.62 | 0.51 | |
| MCP | 0.88 | 0.27 | 0.83 | 0.29 | 0.64 | 0.50 | |
| 0.86 | 0.29 | 0.84 | 0.26 | 0.64 | 0.50 | ||
| 0.75 | 0.57 | 0.51 | 0.53 | 0.61 | 0.67 | ||
| 0.74 | 0.58 | 0.58 | 0.51 | 0.61 | 0.67 | ||
| SCAD | 0.72 | 0.59 | 0.73 | 0.45 | 0.59 | 0.69 | |
| MCP | 0.74 | 0.57 | 0.73 | 0.46 | 0.58 | 0.70 | |
| 0.73 | 0.60 | 0.68 | 0.48 | 0.57 | 0.70 | ||
Figure 4The value of residual () on different datasets.
Figure 5The number of descriptors obtained by the multiple linear regression with the different penalties on different datasets(different colors represent different datasets).
The 9 top-ranked descriptors identified by , , SCAD, MCP, and log-sum from the GHLI dataset (the common descriptors are emphasized in bold).
| Rank | GHLI | |||||
|---|---|---|---|---|---|---|
| SCAD | MCP | Log-Sum | ||||
| Mp | minsCl | |||||
| MDEC-44 | ATSC1e | |||||
| minaaN | ||||||
| AATS0e | WPOL | |||||
| GGI9 | meanI | |||||
| ALogP | ||||||
| nFG12Ring | ||||||
| ETA_Epsilon_3 | AATS6i | AATS0v | ||||
| minHCsatu | SIC1 | ATS4v | AATSC8m | ATS4p | ||
The 10 top-ranked descriptors identified by , , SCAD, MCP, and log-sum from the EDCER dataset (the common descriptors are emphasized in bold).
| Rank | EDCER | |||||
|---|---|---|---|---|---|---|
| SCAD | MCP | Log-Sum | ||||
| MATS1i | GATS1c | MATS1c | ||||
| piPC6 | ||||||
| nBase | nHBint2 | nTG12Ring | ||||
| GATS8p | nHBd | |||||
| SHBint2 | ||||||
| MATS5v | C3SP2 | ETA_Beta_ns_d | TIC1 | |||
| SpMin5_Bhs | nAcid | SHBint8 | MDEC-24 | AATSC8m | ||
The 8 top-ranked descriptors identified by , , SCAD, MCP, and log-sum from the BATZD dataset (the common descriptors are emphasized in bold).
| Rank | BATZD | |||||
|---|---|---|---|---|---|---|
| SCAD | MCP | Log-Sum | ||||
| JGI3 | MATS5m | GATS1p | GATS1v | |||
| SdS | C4SP3 | GATS3m | GATS8c | |||
| naaS | ||||||
| mindS | minddssS | ALogP | AATSC4i | |||
| GATS4m | nHsOH | |||||
| maxdS | ETA_Epsilon_4 | SpDiam_Dzp | ||||
The 6 top-ranked descriptors identified by , , SCAD, MCP, and log-sum from the BCL2 dataset (the common descriptors are emphasized in bold).
| Rank | BCL2 | |||||
|---|---|---|---|---|---|---|
| SCAD | MCP | Log-Sum | ||||
| AATSC4s | ||||||
| VE2_D | ||||||
| GATS4s | ||||||
| minHsNH2 | ||||||
| GATS8p | ||||||
| SpMax1_Bhi | nT8Ring | SwHBa | ||||
The detailed information of the descriptors obtained by the log-sum method.
| Descriptor Type | Class | Descriptor |
|---|---|---|
| Autocorrelation | 2D | AATS0v; AATSC4i; AATSC8m; ATS4p; ATSC1p; |
| ATSC4c; ATSC7s; GATS1e; GATS1v; GATS3s; | ||
| GATS8c; MATS1c; MATS8m; AATSC8p; GATS4s | ||
| Atom-type electrotopological state | 2D | Hmax; LipoaffinityIndex; maxaaCH; maxHBa; maxwHBa; |
| naaS; nssO; SHBint2; maxHBint2; minsOH; SwHBa | ||
| Barysz matrix | 2D | SpDiam_Dzp |
| Burden modified eigenvalues | 2D | SpMax1_Bhi |
| Information content | 2D | TIC1 |
| Path counts | 2D | piPC6 |
| Ring count | 2D | nFG12HeteroRing |
| Topological charge | 2D | JGI10 |
| Information content | 2D | IC2 |
The name of the descriptors obtained by the log-sum method.
| Descriptor | Name |
|---|---|
| AATS0v | Average Broto–Moreau autocorrelation-lag 0/weighted by van der Waals volumes |
| AATSC4i | Average centered Broto–Moreau autocorrelation-lag 4/weighted by first ionization potential |
| AATSC8m | Average centered Broto–Moreau autocorrelation-lag 8/weighted by mass |
| ATS4p | Average centered Broto–Moreau autocorrelation-lag 1/weighted by polarizabilities |
| ATSC1p | Centered Broto–Moreau autocorrelation-lag 1/weighted by polarizabilities |
| ATSC4c | Average centered Broto–Moreau autocorrelation-lag 4/weighted by charges |
| ATSC7s | Average centered Broto–Moreau autocorrelation-lag 7/weighted by I-state |
| GATS1e | Geary autocorrelation-lag 1/weighted by Sanderson electronegativities |
| GATS1v | Geary autocorrelation-lag 1/weighted by van der Waals volumes |
| GATS3s | Geary autocorrelation-lag 3/weighted by I-state |
| GATS8c | Geary autocorrelation-lag 8/weighted by charges |
| hmax | Maximum H E-state |
| JGI10 | Mean topological charge index of order 10 |
| LipoaffinityIndex | Lipoaffinity index |
| MATS1c | Moran autocorrelation-lag 1/weighted by charges |
| MATS8m | Moran autocorrelation-lag 8/weighted by mass |
| maxaaCH | Maximum atom-type E-state: :CH: |
| maxHBa | Maximum E-states for (strong) hydrogen bond acceptors |
| maxwHBa | Maximum E-states for weak hydrogen bond acceptors |
| naaS | Count of atom-type E-state::C:- |
| nFG12HeteroRing | Number of >12-membered fused rings containing heteroatoms (N, O, P, S or halogens) |
| nssO | Count of atom-type E-state: -O- |
| piPC6 | Conventional bond order ID number of order 6 (ln(1 + x) |
| SHBint2 | Sum of E-state descriptors of strength for potential hydrogen bonds of path length 2 |
| SpDiam_Dzp | Spectral diameter from Barysz matrix/weighted by polarizabilities |
| SpMax1_Bhi | Largest absolute eigenvalue of Burden-modified matrix - n 1/weighted by the relative first ionization potential |
| TIC1 | Total information content index (neighborhood symmetry of 1-order) |
| SwHBa | Sum of E-states for weak hydrogen bond acceptors |
| AATSC8p | Average centered Broto–Moreau autocorrelation-lag 8/weighted by polarizabilities |
| IC2 | Information content index (neighborhood symmetry of 2-order) |
| GATS4s | Geary autocorrelation-lag 4/weighted by I-state |
| maxHBint2 | Maximum E-State descriptors of strength for potential Hydrogen Bonds of path length 2 |
| minsOH | Minimum atom-type E-state: -OH |