Literature DB >> 35559135

Visualization of Topological Pharmacophore Space with Graph Edit Distance.

Abstract

A topological pharmacophore (TP) is a chemical graph-based pharmacophore representation, where nodes are pharmacophoric features (PF) and edges are topological distances between PFs. Previously proposed sparse pharmacophore graphs (SPhGs) for TPs were shown to be effective in identifying structurally different active compounds while maintaining the interpretability of the graphs. However, one limitation of using SPhGs as queries is that many structurally similar SPhGs can be identified from a set of active compounds, requiring the classification and visualization of SPhGs, followed by an understanding of the pharmacophore hypotheses. In this study, we propose a scheme for SPhG analysis based on dimensionality reduction techniques with the graph edit distance (GED) metric. This metric enables measuring similarities among SPhGs in a quantitative manner. The visualization of SPhGs, which themselves are the graphs shared by active compounds, can help us understand the pharmacophore hypotheses as well as the data set. As a proof-of-concept study, we generated two-dimensional SPhG-maps using three dimensionality reduction techniques for six biological targets. A comparison with other pharmacophore representations was also conducted. We demonstrated knowledge extraction (interpretation of the data set) from the generated maps. Our findings include a suitable mapping algorithm as well as a pharmacophore hypothesis analysis procedure using an SPhG-map.

Entities: Chemical

Year: 2022 PMID： 35559135 PMCID： PMC9088954 DOI： 10.1021/acsomega.2c00173

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

A ligand-based pharmacophore is the geometrical arrangement of chemical features in a ligand molecule responsible for molecular interactions against the target macromolecule.[1] Pharmacophoric features (PFs) consist of single atoms or sets of atoms based on interaction types, such as hydrogen bonding and lipophilic interactions. Because a pharmacophore can be regarded as an interaction hypothesis, it can be used as a query for screening chemical libraries.[2−7] Topological pharmacophore models, coined by Schneider et al., employ chemical graph paths as the distance function among PFs.[8,9] On a chemical graph, nodes and edges correspond to atoms and covalent bonds, respectively.[8,9] The distance between PFs is simply the number of covalent bonds on the path, ignoring bond lengths and types. The distances were proposed as “separation” by Smith et al.[10] In contrast to geometry-based pharmacophores, for which plausible conformations are necessary,[11−13] topological pharmacophores are rigorous and at the same time provide less information on three-dimensional molecular coordinates. They are also less computationally demanding and can be applied to large-scale data sets. Sophisticated multidimensional molecular descriptors embodying this representation have been successfully utilized for identifying novel hit compounds in prospective virtual screening campaigns.[9,14] Graph representations of the topological pharmacophores, termed pharmacophore graphs (PhGs),[8,9] have been demonstrated to classify a large-scale data set of BCR-ABL tyrosine kinase.[15] A PhG is a complete graph with PFs as nodes and their topological distances on edges. Sharing of PhGs with a number of active compounds becomes the interaction hypothesis against the target macromolecule. In a series of retrospective validation studies, we found that the PhGs extracted from a compound set containing many unique scaffolds became useful queries for identifying structurally different active compounds from the training compounds.[16] In other words, the PhGs shared by diverse active compounds can be useful for scaffold hopping (SH).[8,9,17−19] SH requires a method to capture the functional similarity with a focus on the interaction and freedom from scaffold-based or structural similarity.[8,9] One drawback of PhGs is the lack of interpretability. Complete graphs are hard to interpret because all of the nodes (PFs) are connected to the rest of the nodes. Translating PhGs to the corresponding chemical graphs is not straightforward. In this respect, reduced-graph forms with a small number of edges are preferable. In our previous study, a sparse form of PhGs, termed sparse PhGs (SPhGs), was proposed. We also showed that SPhGs had much fewer edges (close to tree structure) than PhGs, while they were slightly inferior to the PhGs in terms of screening performances.[20] One major difference between SPhGs and other reduced graphs[21−23] in addition to extraction algorithms is that SPhGs are shared graphs found in multiple active compounds, meaning each SPhG manifests a pharmacophore hypothesis. However, one limitation of a set of SPhGs as pharmacophore hypotheses is that there exist a large number of similar SPhGs for a set of active compounds. The classification and visualization of SPhGs are necessary for understating the data set. A set of PhGs can be visualized as a network where nodes are PhGs and edges are the parent–child relation of PhGs.[15] A child PhG is created by adding a PF to the parent PhG. This type of connection is effective for visualizing a course of PhG development. However, no connection is detected between PhGs with the same number of PFs and slightly different edge distances. Like compound visualization in the field of chemography,[24] a proper similarity measurement (metric) is necessary for understanding a set of PhGs by visual inspection. In this study, we propose to visualize SPhGs using the graph edit distance (GED) to understand topological pharmacophore relations in a set of active compounds by clustering analysis. GED was previously employed to quantitatively compare reduced graphs for similarity searching.[25] In the previous study, reduced graphs were generated to compare the corresponding compounds: each reduced graph corresponds to a single compound. However, we focus on the visualization of SPhGs, which themselves are common features among active compounds, and visualizing them leads to understanding the relations of the pharmacophore hypotheses. Active compounds against six target macromolecules were analyzed with the proposed methods and the extracted features, and interpretability is discussed. Python scripts for creating SPhGs from a set of compounds and clustering, including outputs for the six targets in this study, are available in the open access repository: github.com/n-hiroshi/sphg2.

Materials and Methods

Compound Data Sets

Active compounds for six biological targets were extracted from the ChEMBL database (version 24).[26] The number of highly potent compounds for each data set is listed in Table , along with the abbreviation and the CHEMBL ID. The selected targets were Thrombin (Thr.), Tyrosine kinase ABL1 (ABL1), κ-opioid receptor (Kop.), PI3-kinase p100-α subunit (PI3), G protein-coupled receptor 44 (GPCR44), and transmembrane protease serine 6 (TPS6) on the basis of protein family types. Active compounds with more than or equal to 6.0 in terms of pK were regarded as highly potent, except for TPS6, whose potency threshold was lowered to 5.0 due to the limited number of eligible compounds (only three if 6.0 was employed). ChEMBL records with a confidence score of 9 were only processed. When multiple pK values were available for a single compound, the arithmetic mean was calculated to yield its final potency value as long as all of the values fell into the same order of magnitude; otherwise, the compound was discarded. Compounds with molecular weights between 200 and 600 were used for subsequent analyses to reduce computational burden and to remove compounds with extreme properties. All of the highly potent compounds for the six targets are provided as SMILES with curated pK values in the open access repository github.com/n-hiroshi/sphg2. For TPS6, highly potent active compounds with pK values are visualized in Figure .

Table 1

Compound Data Sets

ChEMBL ID	target	code	#Highly potent CPDsa
CHEMBL204	thrombin	Thr.	514
CHEMBL1862	tyrosine kinase ABL1	ABL1	515
CHEMBL237	κ-opioid receptor	Kop.	1425
CHEMBL2498	PI3-kinase p110-α subunit	PI3	812
CHEMBL5701	G protein-coupled receptor 44	GPCR44	686
CHEMBL1795139	transmembrane protease serine 6	TPS6	21

Highly potent CPDs: compounds exhibiting pK values greater than or equal to 6.0 except for TPS6 (5.0).

Figure 1

Highly potent compounds for TPS6.

Highly potent compounds for TPS6. Highly potent CPDs: compounds exhibiting pK values greater than or equal to 6.0 except for TPS6 (5.0).

TP Representations

Three representations for topological pharmacophores were tested: conventional pharmacophore fingerprints (PhFP),[27] molecular sparse pharmacophore graphs (Mol-SPhGs), and sparse pharmacophore graphs (SPhGs).[20] PhFP is a bit vector, and the other representations are graphs. The three representations are illustrated in Figure .

Figure 2

Overview of the three TP representations. (a) Sample molecule with PFs. Blue circles represent HBDs, red circles HBAs, and green circles ARs. (b) Example of PhFP. Each box has a value of 0 or 1. The value of 1 in a box means that the corresponding pharmacophoric pattern(s) exist. Distances between PFs are binned into three categories as mentioned in the parenthesis (lower, upper) in (b). (c) Mol-SPhG converted from the molecule depicted in (a). The letters representing each PF are written on the nodes (D: hydrogen bond donor, A: hydrogen bond acceptor, R: aromatic ring, P: positively ionizable). A node with multiple PFs has the corresponding multiple letters. (d) SPhG generated from Mol-SPhG (c).

PFs

A PF is a chemical feature of a ligand molecule that characterizes an interaction between the ligand and the target macromolecule. Such interactions include hydrogen bonding and electrostatic interactions. Consequently, hydrogen bond donor (HBD), hydrogen bond acceptor (HBA), aromatic rings, and positive/negative ionizable groups are frequently used as PFs. In this study, we employed the RDKit implementation described under the file name of “BaseFeatures.fdef” to identify atoms or groups of atoms matching PFs.[27,28]

PhFPs

A PhFP is a set of combinations of PFs with the topological distances among them.[27] Each combination represents the pharmacophore pattern containing a fixed number of PFs and the distances among them, forming a bit in the fingerprint vector (Figure b). In a bin, a range of distance instead of an exact distance takes bond length ambiguity into account. For avoiding combinatorial explosion and too sparse bit vectors, the number of PFs is usually limited to 3 and the maximum distance to 8.[27] Similar atom-pair-based fingerprints are proposed by Capecchi et al.[29] In this study, the RDKit function of topological pharmacophore with the default parameter values was used.[27]

Mol-SPhGs

A Mol-SPhG is a reduced graph of a chemical graph. Nodes of Mol-SPhG are PF-assigned atoms (termed PF nodes) or junction atoms.[20] The junction atoms are nodes without PFs, which are introduced to keep the original distances among PF nodes. Because Mol-SPhG holds the topological distance between every pair of PF nodes, no information is lost in terms of TPs (Figure c). Details of the construction algorithm of Mol-SPhG from a chemical graph have been reported by our group.[20]

SPhGs

An SPhG is a sparse representation of a TP in terms of the number of edges and nodes (Figure d).[20] This form of pharmacophore has a good balance of trade-off between intuitive understanding of the TP and keeping topological distances among PF nodes. The previous study using an active compound data set for thrombin showed that more than 90% of SPhGs kept the topological distances, while a sparse index of 1.02 was achieved on average. The sparse index is defined aswhere NE is the number of edges and NN is the number of nodes. The average value of 1.02 implied that most SPhGs are tree structures. For obtaining SPhGs, candidate graphs for SPhGs (candidate SPhGs) are generated from Mol-SPhG by selecting a predefined number of PF nodes and applying a node reduction algorithm. Our proposed algorithm is to remove unnecessary nodes and convert aromatic ring features to aromatic bonds. The candidate SPhGs are further filtered based on the number of active compounds or scaffolds containing the candidate SPhGs. The SPhGs passing the filter represent shared TPs among active compounds. In this study, six PF nodes were selected to form candidate SPhGs. For each target, the top 300 SPhGs were selected in terms of the number of the Bemis–Murcko scaffolds of the compounds matching the candidate SPhGs (the NScaffolds criterion), identical to the conditions in the previous studies.[15,16,20,30]

Distance Metrics for TP Representations

A similarity of pharmacophore graphs is quantitatively measured by the GED. Jaccard distances measured how (dis)similar a pair of PhFP bit vectors is.

GED

The GED of graphs A and B is the minimum cost of converting graph A to graph B by editing nodes and edges of graph A.[31] In other words, the graph similarity is measured by how easily graph A is transformed to graph B. For calculating the GED, editing operations and associated cost definition are necessary. We used six edit operations: node substitution, node insertion, node deletion, edge substitution, edge insertion, and edge deletion. Based on the work by Garcia-Hernandez et al.,[25] costs of all node and edge operations were newly defined, which are reported in Tables and 3, respectively. According to these tables, the cost of node insertion and deletion is 1, and the cost of changing a node from one PF type to another is 2, which equals the sum of the node deletion of the old PF and the insertion of the new PF. Also, the cost of removing one PF (e.g., removing only D) from a node with two PFs (e.g., DP) is the same as the general node deletion cost of 1. The cost of changing a node with two PFs to a new PF(s) is set to 2. In addition to the original definition, our modification of the cost tables for SPhGs or Mol-SPhGs includes three major points. First, the definition for junction nodes, represented as J, is added to the node operation table (Table ). Every cost of this node modification is defined as 2 because type J is equally (dis)similar to other node types. Second, the cost of edge distance operations is newly defined as shown in Table . The cost of the replacement of two nonaromatic edges with lengths n and m (n > m) is defined asThis monotonical decreasing cost with the edge length matches our intuition about molecules. For example, changing an edge with a length of one to an edge with a length of two has a higher impact than changing an edge with six to seven. The cost of substitutions of a nonaromatic edge for an aromatic edge with the same length is defined as 3 times their edge length based on ref (25). In a similar way, the substitution of two aromatic edges with different lengths costs 10 times more than the corresponding nonaromatic edges, as a 10 times cost was given for the insertion and deletion of a single, double, or triple bond in ref (25). The robustness of GEDs on edge cost functions was confirmed by testing other forms of functions. High distance correlation coefficients of GEDs were observed when using a square root of k or a square k function instead of k in eq (Table S1).

Table 2

Node Edit Costs in GED Calculation

	D	A	P	N	R	J	DA	DP	DN	AP	AN	PN	DAP	DAN
Da	0	2	2	2	2	2	1	1	1	2	2	2	1	1
Ab	2	0	2	2	2	2	1	2	2	1	1	2	1	1
Pc	2	2	0	2	2	2	2	1	2	1	2	1	1	2
Nd	2	2	2	0	2	2	2	2	1	2	1	1	2	1
Re	2	2	2	2	0	2	2	2	2	2	2	2	2	2
Jf	2	2	2	2	2	0	2	2	2	2	2	2	2	2
DAg	1	1	2	2	2	2	0	2	2	2	2	2	2	2
DPg	1	2	1	2	2	2	2	0	2	2	2	2	2	2
DNg	1	2	2	1	2	2	2	2	0	2	2	2	2	2
APg	2	1	1	2	2	2	2	2	2	0	2	2	2	2
ANg	2	1	2	1	2	2	2	2	2	2	0	2	2	2
PNg	2	2	1	1	2	2	2	2	2	2	2	0	2	2
DAPg	1	1	1	2	2	2	2	2	2	2	2	2	0	2
DANg	1	1	2	1	2	2	2	2	2	2	2	2	2	0
insertion	1	1	1	1	1	0.5	1	1	1	1	1	1	1	1
deletion	1	1	1	1	1	0.5	1	1	1	1	1	1	1	1

D: hydrogen bond donor.

A: hydrogen bond acceptor.

P: positively ionizable.

N: negatively ionizable.

R: aromatic ring.

J: junction.

Double and triple symbols mean the node to which two or three PFs are assigned.

Table 3

Edge Edit Costs in GED Calculation

	nonaromatic edge with a length of na	aromatic edge with a length of na
nonaromatic edge with a length of ma
aromatic edge with a length of ma
insertion	0.1	1.0
deletion	0.1	1.0

Without the loss of generalizability, the inequality n ≥ m can be assumed.

D: hydrogen bond donor. A: hydrogen bond acceptor. P: positively ionizable. N: negatively ionizable. R: aromatic ring. J: junction. Double and triple symbols mean the node to which two or three PFs are assigned. Without the loss of generalizability, the inequality n ≥ m can be assumed. While calculating GEDs of Mol-SPhGs, a time limitation was introduced, which was implemented in the networkx library.[32] Mol-SPhGs have more PF nodes than SPhGs, which sometimes results in too much time taken for GED calculation.[18] The time-limitation option causes the minimum graph edit path search to be terminated after a predefined time and the current minimum distance to be given as output. In this study, a value of ten seconds was set, resulting in the consumption of 283 h of CPU time for calculating GEDs for the Kop. data set, which contained 1425 compounds (1 016 025 comparisons). For SPhGs, the calculation time was reduced to around 1.5 h of CPU time for 300 SPhGs (45 300 comparisons) due to the sparseness of SPhGs. It should be noted that a pairwise comparison can be parallel, leading to the further reduction of computation time. We used the approximated GED implemented in the networkx library (version 2.5).[32]

Visualization of the TP Space

Distance metrics enable visualizing data sets in terms of pharmacophore representations by means of unsupervised learning techniques. Three-dimensionality reduction methods were tested: t-distributed stochastic neighbor embedding (t-SNE),[33] isomap,[34] and multidimensional scaling (MDS),[35] all of which are implemented in the scikit-learn library (version 0.23.2).[36] These three visualization methods were employed with the three pharmacophore representations: PhFPs, Mol-SPhGs, and SPhGs, generating nine maps for a single biological target. We call these maps PhFP-map, Mol-SPhG-map, and SPhG-map, respectively, while ignoring the mapping algorithms. On a map using PhFP or Mol-SPhG representation (PhFP-map or Mol-SPhG-map), each dot matches one Mol-SPhG or one vector of PhFP, which also corresponds to each CPD for which the representation is generated. Furthermore, dots are colored according to pK values. On an SPhG-map, dots correspond only to pharmacophore graphs. The dots are colored according to the coverage, which is defined as the ratio of the compounds covered by the SPhG over the total number of compounds in the data set.

Results and Discussion

TP Maps

Visualization of SPhGs gives us an intuitive understanding of the SPhG relation, which cannot be achieved by inspecting the chemical space spanned by molecular descriptors including TP fingerprints. The main difference between these two maps is that SPhGs are shared features among active compounds, not compounds themselves. We employed the three-dimensionality reduction algorithms Isomap, MDS, and t-SNE to make two-dimensional maps for a set of highly potent compounds represented by PhFP, Mol-SPhG, and SPhG, resulting in nine maps for each target. All of the maps for all of the six biological targets are reported in Figures S1–S6 in the Supporting Information. t-SNE was selected because it formed clusters and was suitable for the later discussion of pharmacophore space. On most maps created by MDS, clustered regions were not created at all, and dots (compounds or SPhGs) were overlapped one another on some maps by Isomap, although this algorithm could make clustered regions. Furthermore, t-SNE mapping showed the best ability to preserve distances in GEDs between SPhGs, in particular for similar SPhGs. For six out of the seven targets, when measuring the shortest 1% distances, t-SNE showed the highest correlation coefficients between GEDs and Euclidian distances on the maps ranging from 0.708 to 0.857. The distance correlation coefficients using the thresholds of 1, 3, and 10% are reported in Table S2. Thus, we decided to further discuss using the maps by the t-SNE algorithm. In the following section, first, the difference between Mol-SPhGs and PhFP as a molecular representation is clarified. Then, using SPhG-maps, the TP information, which could be extracted for the highly potent CPDs, is discussed.

Comparison of Mol-SPhGs with PhFP

Two maps using PhFP and Mol-SPhG representations for Thr. are reported in Figure a,b, respectively. Example compounds in Figure a were selected based on the k-means clustering. The number of clusters was determined so that the sum of the squared errors inside the clusters reached a 90% reduction for the first time as the number of clusters increased. In each cluster, one compound with the highest pK value is displayed. Figure b shows the Mol-SPhG-map, where each point corresponds to a compound as in PhFP. The CPD1 to 6 in Figure a were represented as Mol-SPhGs on the map.

Figure 3

Maps for Thrombin (Thr.). (a) PhFP-map of Thr. Typical active compounds (CPD) are shown on the clustering map. Each point represents each CPD, and its color is defined by its pK. Six exemplified Mol-SPhGs selected by the k-means method are displayed. (b) Mol-SPhG-map of Thr. Each dot corresponds to a compound. The Mol-SPhGs of compounds CPD1–6 are displayed along with their locations. On the PhFP-map, there were more clusters than on the Mol-SPhG-map in Figure . For example, the cluster to which CPD3 belonged consisted of CPDs with the same scaffold in terms of the Bemis–Murcko scaffolds. However, these clusters did not exist on the Mol-SPhG-map. On the Mol-SPhG-map, CPD3 belonged to a single cluster with molecules containing different scaffolds. Molecular scaffold-based clustering could miss the actual (topological) pharmacophore. Actually, several compounds belonging to different scaffolds were found to interact with thrombin on the same binding site, supported by X-ray co-crystallization complexes.[37−42] Three of the example CPDs, CPD2, CPD5, and CPD6, contained amidine substructures. However, CPD1 and CPD3 had no substructures similar to amidines. The chemical structures of CPD2 and CPD5 shared no common scaffolds. However, their Mol-SPhGs-based scaffolds were relatively similar to each other (Figure b). These Mol-SPhGs contained the two Ds (HBDs) connected with a two-length bond and the aromatic bonds next to the junction node between the two Ds. Furthermore, negatively ionizable features, carboxy groups, were located on the opposite side of the graphs to the two Ds. This indicated that the mapping using Mol-SPhGs with GED clustered CPDs in a less structurally dependent manner. Similar characteristics were observed for the other targets. For example, for ABL1 inhibitors, the substructure of dashed circles on CPD1, CPD2, CPD3, and CPD5 in Figure a became a core of these compounds, and they were relatively dispersed. On the other hand, in the form of Mol-SPhG (Figure b), these CPDs were located closer to each other. Furthermore, Mol-SPhGs made interpretation easier because they reflected how easy (difficult) one graph can be modified to another. For example, CPD1 and CPD5 were distinct on the PhFP-map but not on the Mol-SPhG-map. The Mol-SPhGs of these two CPDs had the same heterocyclic structure consisting of two fused pyridines, and the difference in substituents was measured by GED, resulted in the relatively short distance between these two CPDs on the map in Figure b.

Figure 4

Maps for tyrosine kinase ABL1 (ABL1). (a) PhFP-map of ABL1 typical active compounds (CPD) are shown on the clustering map. Each dot represents each CPD, and its color is defined by its pK. Six exemplified Mol-SPhGs selected by the k-means method are displayed. (b) Mol-SPhG-map for ABL1. Each dot corresponds to a compound. The Mol-SPhGs of compounds CPD1–6 are displayed along with their locations.

SPhG-maps

Figures and 4 show that the capturing TP information by Mol-SPhG-maps was less dependent on the structural scaffolds. In the following, we further discuss SPhGs-maps. It should be noted again that SPhGs are the extracted common subgraphs of Mol-SPhGs. The visualization of SPhGs is conceptually different from visualizing CPDs on Mol-SPhG-maps. For each target, the number of Bemis–Mucko scaffolds found in the compounds containing the selected 300 SPhGs was counted. The average number was 35.9 for Thr., 12.9 for ABL1, 29.2 for Kop, 68.7 for PI3, 29.7 for GPCR44, and 2.2 for TPS6. Although the number of scaffolds for TPS6 was small due to a small data set size, selected SPhGs were indeed common features of active molecules not dependent on molecular scaffolds. For the SPhG examples found in the following SPhGs-maps, the number of scaffolds is reported in Table S3. The number of clusters on the SPhG-maps was determined using the same criteria used in Figures a and 4a. The SPhGs examples shown in the figures exhibited the highest and the second-highest coverages, as indicated in the figure captions.

Thrombin (Thr.)

SPhGs were clustered into two distinct regions in Figure . On the top left cluster, SPhG5 and SPhG6 contained the same subgraph with four positively ionizable features (Ps) following by a long chain without any PFs. The positively ionizable feature corresponded to the guanidium substructure. On the other cluster on the right bottom, there were no Ps in the SPhGs forming the cluster. SPhG1 and SPhG2 had two hydrogen bond donors (Ds in Figure ), which commonly had a junction node with a distant one. SPhG3 and SPhG4 did not have this subgraph.

Figure 5

SPhG-map for Thr. Each point represents SPhG, and its color is defined by its coverage. Selected SPhGs are shown on the map. The SPhGs with the first and second-highest coverages in each of the classes categorized by the k-means method are displayed. The SPhG-map displayed graphs with six PFs and a few additional junction nodes, commonly identified among active CPDs. This led to pharmacophore hypotheses of the ligand–target interaction. For example, in Figure , SPhG1-2 had a common substructure of two HBDs (Ds) and a junction between them at a distance of 1. Another donor was found at a distance of 6 from the junction, and a pair of D and HBA (A) at a distance of 2 on the opposite side of the two HBDs was typical. These features here were also consistent with those explained by the X-ray cocrystallized structures listed in the Protein Data Bank (PDB).[37]

Tyrosine Kinase ABL1 (ABL1)

The SPhG-map for ABL1 along with the selected SPhGs colored based on the coverage for ABL1 are shown in Figure . The SPhG with the highest coverage was SPhG1 (61.7%), meaning that over 60% of the active compounds contained SPhG1. Overall, the SPhGs on the maps resembled one another. SPhG1 contained one fused aromatic ring consisting of two rings, with an HBA on one of the rings. This substructure was commonly detected in SPhG2 and SPhG3, which were also included in a substructure of isoquinoline in CPD1, CPD2, and CPD3 in Figure . Furthermore, a pattern of HBA and HBD separated by a distance of two followed by an aromatic bond was found in SPhG1, 4, and 5. While these features might be detected from the Mol-SPhG-map in Figure by careful inspection, the SPhG-map represented the relations. The design concept of the ABL1 inhibitors could be interpreted with the help of the SPhG-map.

Figure 6

SPhG-maps for tyrosine kinase ABL1 (ABL1). SPhG-map of ABL1. Each point represents SPhG, and its color is defined by its coverage. Selected SPhGs are shown on the map. The SPhGs with the highest coverages in each of the clusters categorized by the k-means method are displayed.

κ-Opioid Receptors (Kop.)

On the SPhG-map for Kop., as shown in Figure , SPhG1 and SPhG2 belonged to the same cluster, in which SPhGs contained an aromatic bond feature with a length of two and a branch to an HBA starting at the middle of the bond. SPhG3 in the cluster at the upper right corner had an aromatic bond feature with a length of three (not two). Although SPhG1 and SPhG3 seemed similar and the GED distance between SPhG1 and SPhG3 was 3.25 (7.5 percentile of the whole pairwise distances for all of the SPhG pairs in Figure ), the encoded features (PFs with bonds) were different (Figure a). SPhG1 and SPhG3 matched different paths to the same PFs on the same compound. Out of the three SPhGs, only SPhG1 was detected in the active compounds with different scaffolds, similar to that of pentazocine as shown in Figure b. SPhG4 also matched the compound in Figure a without introducing aromatic bonds, focusing only on the hydrogen bonds. SPhGs in the small cluster including SPhG5 on the left side of the map only matched the different chemotypes represented by the compounds shown in Figure S7. Note, as shown in Figure , an SPhG could contain a node with two PFs (e.g., DA). This meant that a substructure matching both PFs, such as a hydroxyl group, was necessary. If only one of them had been required for activity, the mined SPhG would have contained a node with only the PF.

Figure 7

Figure 8

Three different SPhGs derived an active compound for Kop. (a) Active compound analogous to morphine containing three different SPhGs. (b) Active compound containing a different scaffold but SPhG1 as a subgraph.

SPhG-map for κ-opioid receptors (Kop.). SPhG-map of Kop. Each point represents SPhG, and its color is defined by its coverage. Selected SPhGs are shown on the map. The SPhGs with the highest coverages in each cluster categorized by the k-means method are displayed. Three different SPhGs derived an active compound for Kop. (a) Active compound analogous to morphine containing three different SPhGs. (b) Active compound containing a different scaffold but SPhG1 as a subgraph.

Transmembrane Protease Serine 6 (TPS6)

The number of active compounds for TPS6 was 21 (Figure ). For this small-sized data set, the SPhG-map could categorize a number of SPhGs into different clusters (Figure ). Because each SPhG represented a TP hypothesis, extracting common SPhGs followed by the clustering analysis gave insights into the hypotheses, as opposed to PhFP and Mol-SPhG-maps in Figure S6. The top right cluster on the map in Figure might correspond to the hypothesis of the guanidium moiety and other hydrogen bonding features on the opposite side as exemplified in SPhG2. On the other hand, SPhG1, SPhG4, and SPhG5 in other clusters corresponded to the arrangement of hydrogen bonding. These three SPhGs had two HBDs (Ds in Figure ) and a junction node between them. From the junction node, another HBD is placed at a distance of six, followed by two HBAs (As in Figure ) with a distance of three. These SPhGs were similar to ones for Thr., and SPhG5 in Figure was identical to SPhG2 in Figure . An experimental study showed that the compounds containing SPhG2 in Figure , which is identical to SPhG5 in Figure , were active for both Thr. and TPS6.[43] The common SPhG was successfully identified in this study. The SPhGs on the bottom left cluster where SPhG3 were representatives were completely different hypotheses and matched CPD5 and CPD16 in Figure . These types of SPhGs were not found in Thr. This implied that the CPDs, which included SPhG3 and did not include SPhG5 in Figure , were expected to be active for TPS6 and not for Thr.

Figure 9

SPhG-map for transmembrane protease serine 6 (TPS6). SPhG-map of TPS6. Each point represents SPhG, and its color is defined by its coverage. Selected SPhGs are shown on the map. All active CPDs with pK > 5.0 for TPS6 in our data set are displayed. The SPhG with the highest coverage in each cluster categorized by the k-means method is shown.

Conclusions

The visualization of topological pharmacophores (TPs) is important for understanding the ligand–target binding hypotheses. In this study, GED was introduced as a metric to evaluate the similarity among SPhGs, which were sparse representations of pharmacophore graphs. Among the three tested dimensionality reduction algorithms, t-SNE was the best based on the visual inspection and local-distance preservation of GEDs. For evaluating the maps and demonstrating the use case of the maps, we generated SPhG-maps using active compounds against the six biological targets: Thr., ABL1, Kop., PI3, GPCR44, and TPS6. First, we compared the two TP representations using the maps PhFP and Mol-SPhG and found that Mol-SPhG was less structurally dependent than PhFP. Then, for each target, the top 300 SPhGs identified from a set of active compounds were visualized on an SPhGs-map with the GED metric. The classification of SPhGs and TP knowledge extraction were demonstrated using the maps.

32 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Chemography: the art of navigating in chemical space.

Authors: T I Oprea; J Gottfries
Journal: J Comb Chem Date: 2001 Mar-Apr

3. "Scaffold-Hopping" by Topological Pharmacophore Search: A Contribution to Virtual Screening.

Authors:
Journal: Angew Chem Int Ed Engl Date: 1999-10-04 Impact factor: 15.336

4. LigandScout: 3-D pharmacophores derived from protein-bound ligands and their use as virtual screening filters.

Authors: Gerhard Wolber; Thierry Langer
Journal: J Chem Inf Model Date: 2005 Jan-Feb Impact factor: 4.956

5. New allosteric modulators of metabotropic glutamate receptor 5 (mGluR5) found by ligand-based virtual screening.

Authors: Steffen Renner; Tobias Noeske; Christopher G Parsons; Petra Schneider; Tanja Weil; Gisbert Schneider
Journal: Chembiochem Date: 2005-04 Impact factor: 3.164

6. Scaffold-hopping potential of ligand-based similarity concepts.

Authors: Steffen Renner; Gisbert Schneider
Journal: ChemMedChem Date: 2006-02 Impact factor: 3.466

7. Sparse Topological Pharmacophore Graphs for Interpretable Scaffold Hopping.

Authors: Hiroshi Nakano; Tomoyuki Miyao; Jasial Swarit; Kimito Funatsu
Journal: J Chem Inf Model Date: 2021-07-15 Impact factor: 4.956

8. Structure of thrombin complexed with selective non-electrophilic inhibitors having cyclohexyl moieties at P1.

Authors: R Krishnan; I Mochalkin; R Arni; A Tulinsky
Journal: Acta Crystallogr D Biol Crystallogr Date: 2000-03

9. Kinetic and crystallographic studies of thrombin with Ac-(D)Phe-Pro-boroArg-OH and its lysine, amidine, homolysine, and ornithine analogs.

Authors: P C Weber; S L Lee; F A Lewandowski; M C Schadt; C W Chang; C A Kettner
Journal: Biochemistry Date: 1995-03-21 Impact factor: 3.162

10. Structure based pharmacophore modeling, virtual screening, molecular docking and ADMET approaches for identification of natural anti-cancer agents targeting XIAP protein.

Authors: Firoz A Dain Md Opo; Mohammed M Rahman; Foysal Ahammad; Istiak Ahmed; Mohiuddin Ahmed Bhuiyan; Abdullah M Asiri
Journal: Sci Rep Date: 2021-02-18 Impact factor: 4.379