| Literature DB >> 31052325 |
Samina Kausar1,2, Andre O Falcao3,4.
Abstract
The performance of quantitative structure-activity relationship (QSAR) models largely depends on the relevance of the selected molecular representation used as input data matrices. This work presents a thorough comparative analysis of two main categories of molecular representations (vector space and metric space) for fitting robust machine learning models in QSAR problems. For the assessment of these methods, seven different molecular representations that included RDKit descriptors, five different fingerprints types (MACCS, PubChem, FP2-based, Atom Pair, and ECFP4), and a graph matching approach (non-contiguous atom matching structure similarity; NAMS) in both vector space and metric space, were subjected to state-of-art machine learning methods that included different dimensionality reduction methods (feature selection and linear dimensionality reduction). Five distinct QSAR data sets were used for direct assessment and analysis. Results show that, in general, metric-space and vector-space representations are able to produce equivalent models, but there are significant differences between individual approaches. The NAMS-based similarity approach consistently outperformed most fingerprint representations in model quality, closely followed by Atom Pair fingerprints. To further verify these findings, the metric space-based models were fitted to the same data sets with the closest neighbors removed. These latter results further strengthened the above conclusions. The metric space graph-based approach appeared significantly superior to the other representations, albeit at a significant computational cost.Entities:
Keywords: PCA; QSAR modeling; feature selection; metric space; non-contiguous atom matching structure similarity—NAMS; random forest; support vector machines; vector space
Mesh:
Year: 2019 PMID: 31052325 PMCID: PMC6539555 DOI: 10.3390/molecules24091698
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Figure 1Vector space vs. metric space.
Figure 2Quantitative structure–activity relationship (QSAR) modeling methods.
Data set description.
| Uniprot ID. | Gene Name | Target Protein Name | Associated Bioactivities (Y) | Total Number of Observations (N-Processed) |
|---|---|---|---|---|
| P35367 | HRH1 | Histamine H1 receptor | Ki | 1222 |
| Q99720 | SIGMAR1 | Sigma non-opioid intracellular receptor 1 | Ki | 226 |
| Q12809 | HERG | Potassium voltage-gated channel subfamily H member 2 | Ki | 1481 |
| P35462 | DRD3 | D(3) dopamine receptor | Ki | 2902 |
| P28223 | HTR2A | 5-hydroxytryptamine receptor 2A | Ki | 2088 |
Figure 3Comparisons of QSAR models’ predictive performance using independent validation sets (IVSs). PVE: percentage of variance explained by the model.
Figure 4Friedman test results and interquartile ranges of tested models.
Figure 5(A) Boxplots of the three modeling approaches grouped by the different data sets; (B) groups and interquartile ranges of the medians of tested models from the Friedman test post hoc analysis.
Figure 6Overall performance of similarity representation using PCA on metric space-based QSAR modeling approach.
Data size before and after removing nearest neighbors. Thr—similarity threshold; N—new data set size.
| Target Protein Name | Data Size without Removing Nearest Neighbors | NAMS | ECFP6 | RDkit | Atom Pair | MACCS | Pubchem | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Thr | N | Thr | N | Thr | N | Thr | N | Thr | N | Thr | N | ||
| Histamine H1 receptor (HRH1) | 1222 | 0.80 | 379 | 0.55 | 378 | 0.80 | 371 | 0.67 | 376 | 0.84 | 379 | 0.87 | 391 |
| Sigma non-opioid intracellular receptor 1 (Sigma1R) | 226 | 0.87 | 312 | 0.61 | 310 | 0.89 | 305 | 0.75 | 309 | 0.92 | 311 | 0.94 | 321 |
| Potassium voltage-gated channel subfamily H member 2 (HERG) | 1481 | 0.80 | 397 | 0.54 | 394 | 0.82 | 392 | 0.69 | 395 | 0.83 | 395 | 0.86 | 403 |
| D(3) dopamine receptor (DRD3) | 2902 | 0.80 | 478 | 0.52 | 481 | 0.77 | 470 | 0.67 | 480 | 0.87 | 484 | 0.86 | 484 |
| 5-hydroxytryptamine receptor 2A (HTR2A) | 2088 | 0.80 | 432 | 0.47 | 432 | 0.78 | 424 | 0.63 | 426 | 0.83 | 429 | 0.85 | 437 |
Figure 7Overall performance of metric space representation after removing nearest neighbors in a PCA on metric space-based QSAR modeling approach.