| Literature DB >> 28546583 |
Ignacio Ponzoni1, Víctor Sebastián-Pérez2, Carlos Requena-Triguero2, Carlos Roca2, María J Martínez3, Fiorella Cravero4, Mónica F Díaz4, Juan A Páez5, Ramón Gómez Arrayás6,7, Javier Adrio6,7, Nuria E Campillo8.
Abstract
Quantitative structure-activity relationship modeling using machine learning techniques constitutes a complex computational problem, where the identification of the most informative molecular descriptors for predicting a specific target property plays a critical role. Two main general approaches can be used for this modeling procedure: feature selection and feature learning. In this paper, a performance comparative study of two state-of-art methods related to these two approaches is carried out. In particular, regression and classification models for three different issues are inferred using both methods under different experimental scenarios: two drug-like properties, such as blood-brain-barrier and human intestinal absorption, and enantiomeric excess, as a measurement of purity used for chiral substances. Beyond the contrastive analysis of feature selection and feature learning methods as competitive approaches, the hybridization of these strategies is also evaluated based on previous results obtained in material sciences. From the experimental results, it can be concluded that there is not a clear winner between both approaches because the performance depends on the characteristics of the compound databases used for modeling. Nevertheless, in several cases, it was observed that the accuracy of the models can be improved by combining both approaches when the molecular descriptor sets provided by feature selection and feature learning contain complementary information.Entities:
Mesh:
Year: 2017 PMID: 28546583 PMCID: PMC5445096 DOI: 10.1038/s41598-017-02114-3
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Scheme of the in silico experiments reported for the prediction of blood-brain-barrier (BBB), human intestinal absorption (HIA) and enantiomeric excess (EE).
Metrics of the best QSAR models for each dataset.
| Dataset | Best Regression QSAR Model | Best Classification QSAR Model | |||||||
|---|---|---|---|---|---|---|---|---|---|
| CC | % Training Sampling Size | Mol. D. Subset | Learning Method | %CC | ROC | % Training Sampling Size | Mol. D. Subset | Learning Method | |
| BBB | 0.76 | 66% | D + D | R. Committee | 86.49% | 0.720 | 66% | D + D | N. Networks |
| HIA | 0.75 | 75% | Both | N. Networks | 86.96% | 0.865 | 66% | D + D | R. Forest |
| EE | 0.69 | 75% | C − T | R. Forest | 81.43% | 0.678 | 75% | C − T | R. Forest |
Figure 2Physicochemical and structural representation of the BBB dataset. (A) Dispersion of compounds regarding rotatable bonds and hydrogen bond donors. Colors are defined by stars, a parameter related to physical-chemical properties of commercially available drugs. (B) Dispersion of the dataset taking into account molecular weight theoretical value of blood barrier permeability. The color is defined by the parameter stars. (C) Structural diversity represented as a 3D dispersion of compounds regarding CODES descriptors.
Molecular descriptors subsets used for inferring the blood-brain-barrier QSAR models.
| Subset Name | Method | Size | Molecular descriptor names |
|---|---|---|---|
| CTBBB | CODES-TSAR | 3 | CODES-T1, CODES-T2, CODES-T3 |
| M2BBB | D/DELPHOS | 3 | nR06, SIC1, CIC5 |
| M13BBB | D/DELPHOS | 7 | AMW, RBN, MATS5e, MATS4p, EEig12d, JGI7, Hy |
| M2BBB ∪ CTBBB | Combined | 6 | nR06, SIC1, CIC5, CODES-T1, CODES-T2, CODES-T3 |
| M13BBB ∪ CTBBB | Combined | 10 | AMW, RBN, MATS5e, MATS4p, EEig12d, JGI7, Hy, CODES-T1, CODES-T2, CODES-T3 |
Figure 3Redundancy analysis among the molecular descriptors that conforms the model M13BBB.
Figure 4Physicochemical and structural representation of the HIA dataset. (A) Dispersion of the compounds regarding hydrogen bond donors and rotatable bonds. Color is defined by stars, a parameter related to physical-chemical properties of commercially available drugs. (B) Dispersion of the dataset taking into account logP values and Polar Surface Area (PSA). Color is defined by HIA experimental values. (C) Structural diversity represented as a 3D dispersion of compounds using CODES descriptors.
Molecular descriptor subsets used for inferring the human intestinal absorption QSAR models.
| Subset Name | Method | Size | Molecular descriptor names |
|---|---|---|---|
| CTHIA | CODES-TSAR | 3 | CODES-T1, CODES-T2, CODES-T3 |
| M5HIA | D/DELPHOS | 4 | AMW, MATS7m, ESpm01d, TPSA(NO) |
| M9HIA | D/DELPHOS | 5 | AMW, GATS6v, JGI4, VRp2, TPSA(NO) |
| M5HIA ∪ CTHIA | Combined | 7 | AMW, MATS7m, ESpm01d, TPSA(NO), CODES-T1, CODES-T2, CODES-T3 |
| M9HIA ∪ CTHIA | Combined | 8 | AMW, GATS6v, JGI4, VRp2, TPSA(NO), CODES-T1, CODES-T2, CODES-T3 |
Figure 5Mutual information analysis among the molecular descriptors that conform the M5HIA ∪ CTHIA model.
Figure 6Correlation analysis among the molecular descriptors that conforms the model M9HIA.
Figure 7Accuracy frequency values for QSAR regression models computed using random combined subsets of descriptors.
Figure 8Physicochemical representation of the EE dataset. (A) Dispersion of the compounds regarding hydrogen bond donors and rotatable bonds (RB). Color is defined by molecular weight (MW). (B) Dispersion of the dataset taking into account logP values and molecular weight.
Molecular descriptor subsets used for inferring the QSAR models for enantiomeric excess.
| Subset Name | Method | Size | Molecular descriptor names |
|---|---|---|---|
| CTEE | CODES-TSAR | 10 | Sa, Sb, Sc, Sd, La, Lb, Lc, Ld, Le, Lf |
| M9EE | D/DELPHOS | 4 | AMW Sust, EEig11d Sust, JGI6 Sust, TPSA.NO. Lig |
| M14EE | D/DELPHOS | 17 | AMW Sust, PJI2 Sust, EEig12x Sust, EEig09d Sust, EEig11d Sust, GGI8 Sust, nDB Lig, nH Lig, nR09 Lig, TI2 Lig, PW5 Lig, D.Dr08 Lig, AAC Lig, MATS5v Lig, MATS8v Lig, MATS3p Lig, GATS1e Lig |
| M9EE ∪ CTEE | Combined | 14 | AMW Sust, EEig11d Sust, JGI6 Sust, TPSA.NO. Lig, Sa, Sb, Sc, Sd, La, Lb, Lc, Ld, Le, Lf |
| M14EE ∪ CTEE | Combined | 27 | AMW Sust, PJI2 Sust, EEig12x Sust, EEig09d Sust, EEig11d Sust, GGI8 Sust, nDB Lig, nH Lig, nR09 Lig, TI2 Lig, PW5 Lig, D.Dr08 Lig, AAC Lig, MATS5v Lig, MATS8v Lig, MATS3p Lig, GATS1e Lig, Sa, Sb, Sc, Sd, La, Lb, Lc, Ld, Le, Lf |
Figure 9Mutual information among the substrate and ligand descriptors computed by CODES-TSAR (CTEE subset).
Figure 10Relationship among the molecular descriptors computed by CODES-TSAR (CT subset) and the target property.
Figure 11Number of experimental scenarios where QSAR models obtained by combined subsets improve the performance of the QSAR models inferred by individual subsets.
Discretization criteria for target properties.
| HIA | Tag | Not Absorb | Absorb | |
| Thresholds | <0.7 | >=0.7 | ||
| BBB | Tag | BBB+ | Gray area | BBB− |
| Thresholds | <=−0.7 | >−0.7 y < = −0.3 | >−0.3 | |
| EE | Tag | Low-enantiopurity | High-enantiopurity | |
| Thresholds | from 10% to 90% | from 0% to 10% | ||
| from 90% to 100% | ||||
Figure 12Values and functions assigned by CODES.