| Literature DB >> 33430986 |
Samina Kausar1,2, Andre O Falcao3,4.
Abstract
BACKGROUND: Molecular space visualization can help to explore the diversity of large heterogeneous chemical data, which ultimately may increase the understanding of structure-activity relationships (SAR) in drug discovery projects. Visual SAR analysis can therefore be useful for library design, chemical classification for their biological evaluation and virtual screening for the selection of compounds for synthesis or in vitro testing. As such, computational approaches for molecular space visualization have become an important issue in cheminformatics research. The proposed approach uses molecular similarity as the sole input for computing a probabilistic surface of molecular activity (PSMA). This similarity matrix is transformed in 2D using different dimension reduction algorithms (Principal Coordinates Analysis ( PCooA), Kruskal multidimensional scaling, Sammon mapping and t-SNE). From this projection, a kernel density function is applied to compute the probability of activity for each coordinate in the new projected space.Entities:
Keywords: Molecular/chemical space; Non-metric MDS; Noncontiguous atom matching structural similarity function (NAMS); PCooA; Sammon mapping; Structure activity relationship (SAR); Two dimensional kernel density estimation; t-SNE
Year: 2019 PMID: 33430986 PMCID: PMC6805449 DOI: 10.1186/s13321-019-0386-z
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1Overview of the methodology
Fig. 2Distance functions for similarity to distance transformations
Data set description—data set sizes discriminated by positives and negatives within training/testing data split
| Target protein name | Uniprot ID | Training set | Test set | Mean distance | Distance std. dev. | ||
|---|---|---|---|---|---|---|---|
| Positives | Negatives | Positives | Negatives | ||||
| Sigma non-opioid intracellular receptor 1 (SIGMAR1) | Q99720 | 46 | 135 | 10 | 35 | 0.79 | 0.13 |
| Histamine H1 receptor (HRH1) | P35367 | 184 | 783 | 46 | 195 | 0.83 | 0.08 |
| Potassium voltage-gated channel subfamily H member 2 (HERG) | Q12809 | 39 | 1142 | 12 | 283 | 0.84 | 0.06 |
| D(1B) dopamine receptor (DRD5) | P21918 | 41 | 231 | 5 | 62 | 0.80 | 0.10 |
Includes the average computational distance between compounds of each data set and its respective standard deviation
Fig. 3Test set projection over map surface (PSMA) with PCooA. Surfaces represents higher probability locations. red – circles are ground truth positives, white are ground truth negatives
Results on validation set
| Target protein name | PCooA | KMDS | SM | t-SNE | ||||
|---|---|---|---|---|---|---|---|---|
| AUC | MCC | AUC | MCC | AUC | MCC | AUC | MCC | |
| Sigma non-opioid intracellular receptor 1 (Sigma1R) | 0.87(*) | 0.63 | 0.80 | 0.60 | 0.79 | 0.55 | 0.79 | 0.47 |
| Histamine H1 receptor (HRH1) | 0.80 | 0.45 | 0.83 | 0.43 | 0.78 | 0.36 | 0.87(*) | 0.54 |
| Potassium voltage-gated channel subfamily H member 2 (HERG) | 0.80 | 0.18 | 0.77 | 0.24 | 0.80 | 0.25 | 0.89(*) | 0.33 |
| D(1B) dopamine receptor (DRD5) | 0.98(*) | 0.77 | 0.86 | 0.32 | 0.80 | 0.42 | 0.90 | 0.41 |
| Overall performance (average score) | 0.86 | 0.51 | 0.82 | 0.40 | 0.79 | 0.40 | 0.86 | 0.44 |
PCooA Principal co-ordinates analysis, KMDS Kruskal Multidimensional Scaling, SM Sammon mapping, t-SNE t-distributed stochastic neighbor em-bedding
(*)—best model
Fig. 4Test set projection over map surface of selected PSMAs with highest performance. Surfaces represents higher probability locations. red-circles are ground truth positives, white are ground truth negatives
Fig. 5HERG shepard plot for PCooA, KMDS, SM and t-SNE
Fig. 6Test set projection over 2D probability map of selected models with highest performance. Contour lines represent 2D kernel density distribution of active molecules (positive class) and the colour other than green represents higher probability locations. Red-circles are ground truth positives, white are ground truth negatives. ChEMBL IDs. in red color text (2D structures within red lined box) are true positives and other are false positives (2D structures within white lined box)