Literature DB >> 20037730

A similarity search using molecular topological graphs.

Abstract

A molecular similarity measure has been developed using molecular topological graphs and atomic partial charges. Two kinds of topological graphs were used. One is the ordinary adjacency matrix and the other is a matrix which represents the minimum path length between two atoms of the molecule. The ordinary adjacency matrix is suitable to compare the local structures of molecules such as functional groups, and the other matrix is suitable to compare the global structures of molecules. The combination of these two matrices gave a similarity measure. This method was applied to in silico drug screening, and the results showed that it was effective as a similarity measure.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2009 PMID： 20037730 PMCID： PMC2796334 DOI： 10.1155/2009/231780

Source DB: PubMed Journal: J Biomed Biotechnol ISSN： 1110-7243

1. Introduction

The molecular similarity measure is an important tool in chemometrics and chemoinformatics. The main applications include ligand-based in silico (virtual) drug screening, ADME-Tox (adsorption, distribution, metabolism, excretion, and toxicity) property prediction, physical molecular property prediction (1-octanol-water partition coefficient, solubility), and measurement of the diversity of chemical compounds in a library. The molecular similarity measure generally assigns a one-dimensional (1D) and/or two-dimensional (2D) descriptor—that is, molecular fingerprints based on substructure, molecular mass, number of rotatable bonds, number of hydrogen donors/acceptors of the compound, and so forth—to compounds so that the similarities of the compounds can be evaluated [1-5]. Many methods have been proposed for the similarity search of chemical compounds, such as the comparison of overlapping substructures in the form of Daylight fingerprints (Daylight Chemical Information Systems Inc., Aliso Viejo, CA, USA), the chemically advanced template search (CATS) descriptor method developed by Pickett [6], and the Burden-CAS-University of Texas (BCUT) descriptor method [7]. One of the most widely used methods is to compare the existence of fragment structures; this is the technique employed by the MACSS key, which was developed by Molecular Design Limited (MDL, Santa Clara, CA, USA). Each element of the feature vector of the molecule represents the existence of a particular fragment structure in the molecule (dictionary based fingerprinting). A rather large example of a dictionary used for this fingerprinting technique is the software program DRAGON developed by Talete SRL (geographical information), which consists of more than 3200 molecular descriptors. The affinity fingerprint approach is a new type of similarity search method based on a multiprotein/multicompound affinity matrix [8-21]. In this method, each element of the feature vector of the molecule represents the binding affinity of the molecule with a particular protein. Usually, the binding affinity is measured by calculation using a protein-compound docking program. There are various applications for molecular similarity, and thus many types of similarity measures are needed. Most of the conventional molecular descriptors aim scaffold hopping (lead hopping) to find a compound with a different scaffold from the known active compound. However, in some cases, we want to find similar compounds with similar scaffolds. For example, in lead generation, we want to find a series of similar compounds with the same or similar scaffold instead of performing an actual synthesis. Substructure searches have been used for this purpose [22, 23]. However, a comparison of the indices of molecular topologies is much faster than a substructure search. A topological index, which is any of several numerical parameters of a molecular graph, is also widely used [5]. The Wiener index, Hosoya index, and Randic's molecular connectivity index, are graph invariants and conventional topological indices. These topological indices show correlation to the physical or chemical properties of molecules, although these indices do not recognize atom types and they can be quite difficult to calculate. In the current study, we proposed a new similarity measure for identifying topologically similar compounds based on their molecular topologies and evaluated this method by applying it to a ligand-based drug screening test.

2. Methods

2.1. Similarity and Distance between Compounds

First, the all-atom model compound structures are converted to united atom models, in which all hydrogen atoms are omitted and the atomic charge of the hydrogen atom is added to the atomic charge of the connected heavy atom. In this method, the adjacency matrix E and the distance matrix D are used, and Figure 1 shows an example of these two matrices for a simple graph. The topology of the compound can be represented by an edge-adjacency matrix E [4, 5]:

Figure 1

Example of the matrices E and D.

where e is the a-b element of matrix E. The value of e could be the bond order between the ath and the bth atoms (the value of e could be 1.5 for an aromatic bond). Just as in the BCUT method, the diagonal part (e) is replaced by the converted atomic charge q: where q is an atomic partial charge and c is a coefficient. In this study, c was set to 1.0. The q value is >0 for any q value. The a-b matrix element of the pseudodistance matrix D represents the minimum path length between the ath and bth atoms: where d is a shortest path length between the ath and the bth atoms. We also tried D = d (shortest-path topological distance matrix [4, 5]) and D = d1/2 and found that D = ln (1 + d) gave the best result among these definitions. Let ε and N be the ith eigenvalue of the matrix E or D and the number of atoms of the united-atom model of the compound. We define the eigenvalue histogram g(ε) as follows: Here ε and c are the energy and the arbitral coefficient. The distance S(A, B) between molecules A and B is defined based on the eigenvalue histogram of molecule A(g(ε)) and that of molecule B(g(ε)) as follows: In the current in silico drug screening, candidate hit compounds are selected using the following method. Let S(A, B) and S(A, B) be the distance between A and B based on the adjacency matrix E and that based on the distance matrix D. These two distances give the consensus distance S′(A, B) with the weight parameter λ: Compounds that are close to the known active compounds are selected as the candidate hit compounds. For this purpose, the distance to the known active compounds is introduced. The distance from the kth compound to the average position of the active compounds (Dist) is defined as where A, C, and M are the kth compound, ith active compounds, and the total number of the active compounds. When the number of active compounds is one, Dist = S′. We call Dist the molecular-graph (MG) distance and we call this screening procedure the molecular-graph distance (MGD) method. The eigenvalues (εi) of S and S of each compound of the compound library are stored in a database file a priori. For a query compound, the eigenvalues of S and S must be calculated, which costs less than 1 second. The database search is conducted only to perform the calculations in (4)–(7) and thus is quite fast.

3. Preparation of Materials

For the drug screening test, our target proteins were the macrophage migration inhibitory factor (MIF), cyclooxygenase-2 (COX-2), human immunodeficiency virus protease-1 (HIV), thermolysin (THR), glutathione S-transferase (GST), the histamine H1 receptor, the adrenaline beta receptor, the serotonin receptor, and the dopamine D2 receptor. For validation of the present method, we used the same set of compounds as used in our previous study [20]. Namely, the compound set consisted of 12 inhibitors of MIF, 28 inhibitors of THR, 15 inhibitors of COX-2, 20 inhibitors of HIV, 12 inhibitors of GST, 10 antagonists of the histamine H1 receptor [24], 12 agonists and 13 antagonists of the adrenaline beta receptor [25], 8 agonists and 9 antagonists of the serotonin receptor [26], and 6 agonists and 15 antagonists of the dopamine D2 receptor [27] as the active compounds, along with 11050 potentially negative compounds from the random compound library of the Coelacanth Chemical Corporation (East Windsor, NJ, USA). Typically, only one hit compound could be found out of 104 randomly selected compounds; we therefore expected that there would be no more than a few, if any, hit compounds among these 11212 compounds. The 160 active compounds are listed in the Supplemental Materials available online at doi:10.1155/2009/231780. The size distribution of compounds was as follows: percentage of compounds with 0 ~ 19 atoms, 0.1%; with 20 ~ 29 atoms, 1.2%; with 30 ~ 39 atoms, 1.6%; with 40 ~ 49 atoms, 9.3%; with 50 ~ 59 atoms, 22.5%; with 60 ~ 69 atoms, 37.9%; with 70 ~ 79 atoms, 20.5%; and with more than 80 atoms, 7.0%. The average number of heavy atoms was 30.9. The atomic charge of each ligand was determined by the Gasteiger method [28, 29]. To calculate the Gasteiger charge, an all-atom model is necessary for each compound. The three-dimensional (3D) coordinates of the 11050 random compounds above were generated to add the hydrogen atoms by the Concord program (Tripos, St. Louis, MO, USA) from 2D Sybyl SD files provided by the Coelacanth Chemical Corporation. The 3D coordinates of the active compounds (inhibitors, substrates, agonists, and antagonists) were generated by Chem3D (Cambridge Software, Cambridge, MA, USA).

4. Results

To evaluate the efficiency of this method, the leave-one-out cross-validation test was applied; namely, the active compounds of each target protein were selected one by one as the known active compounds for this software and the other unknown active compounds were discovered by the software. The test dataset consists of these active compounds and the other approximately 104 potential inactive compounds (decoy set). The total number of compounds was 11212. A total of 160 (= total 160 active compounds) database enrichment curves were calculated for these 9 target proteins and 11212 compounds and the results were averaged. The surface area (q or “area under curve”: AUC) under the total database enrichment curve (f) is a measure of the database enrichment: where x and f(x) are the percentages of compounds that are selected from the total compound library and the database enrichment curve, respectively. A higher q value corresponds to better database enrichment, and the q value is always greater than zero and less than 100. For the random screening, q = 50. First, the λ dependence of the hit ratio was examined in the MGD method. The average q values and the hit ratio of the 160 trials with various λ values are summarized in Table 1. The coefficients c for matrices E and D were optimized for every λ to maximize the hit ratio. When λ = 0 or λ = 1, the hit ratio and the q values were lower than those in the other cases. This result showed that the combination of matrices D and E is more effective than the single usage of either D or E. The optimized coefficients were used in the following study. The average eigenvalues of D and E were −3.21 ∗ 10−6 and 0.505 for the decoy set, respectively. The histograms g(ε) of (4) were close to single Gaussian distributions. We show the distributions of g(ε) of H1 antagonist diphenhydramine and COX-2 inhibitor indomethacin in the supplementary data. For diphenhydramine, the average g(ε) values of S and S were 1.46 ∗ 10−3 and 1.009 ∗ 10−3, respectively. The deviations of g(ε) values of S and S were 75.75 and 9.712, respectively. For indomethacin, the average g(ε) values of S and S were 1.46 ∗ 10−3 and 1.009 ∗ 10−3, respectively. The deviations of g(ε) values of S and S were 75.88 and 9.739, respectively.

Table 1

The average q values and the hit ratio with optimized coefficients for various λ values. The coefficient1 is the c parameter in (4) for the E matrix and the coefficient2 is the c parameter in (4) for the D matrix. The hit ratio is the % of the active compounds found within the first 1% of selected compounds of the database. The average and σ are the average value and the standard deviation of the scores. ∗: the top-ranked compounds when diphenhydramine is the template.

λ	0	0.25	0.5	0.75	1
Hit ratio	28.86%	36.48%	35.76%	35.69%	26.93%
q value	80.14	82.53	80.37	82.09	72.72
Coefficient1	0.002	0.01	0.02	0.01	0
Coeffcient2	0	0.00005	0.00005	0.00001	0.0001
Average	0.100 ∗ 10⁻⁴	3.61 ∗ 10⁻⁴	7.13 ∗ 10⁻⁴	10.64 ∗ 10⁻⁴	14.163 ∗ 10⁻⁴
σ	7.50 ∗ 10⁻⁶	1.62 ∗ 10⁻⁴	3.24 ∗ 10⁻⁴	4.87 ∗ 10⁻⁴	6.50 ∗ 10⁻⁴
1st compound*
z-score	1.321	2.218	2.189	2.179	2.174
Score	0.00008295	0.001723	0.001944	0.002164	0.002383
2nd compound*
z-score	1.315	2.202	2.179	2.171	2.167
Score	0.0001266	0.004425	0.005333	0.006242	0.00715
3rd compound*
z-score	1.314	2.192	2.169	2.161	2.157
Score	0.0001338	0.005944	0.008402	0.01085	0.01331

Second, the score distribution was examined by the MGD method. The average values and the standard deviations for the λ values are summarized in Table 1. Figure 2 shows a score distribution with λ = 0.25 using diphenhydramine as the template (see Figure 4). The template corresponds to Dist = 0. The frequency was normalized; the surface area under the curve was set to 1. The distribution is similar to Gaussian distribution.

Figure 2

Distribution of the values of Dist. The template is diphenhydramine. The Dist values are multiplied by 10000 times and the frequency is normalized.

Figure 4

Histamine H1 receptor antagonists.

Figure 3 shows the average database enrichment results of the 160 trial screening tests by the leave-one-out cross-validation test. The MGD method worked well and showed good enrichment. In this calculation, λ was set to 0.25. Within the first 1%, 5%, and 10% of the database, 36.5%, 52.8%, and 60.0% of the active compounds were found by the similarity measure defined by (7), respectively. The average q value by the MGD method using (7) was 82.53. This result is worse than the result by the machine-learning docking score index method reported previously. Namely, about 70% of the active compounds were found within the first 1% of the database and the average q value was 98.5.

Figure 3

Average database enrichment curves of 160 active compounds with the current similarity measure. Open circles represent the average database enrichments with the distance measure defined by (6) (MGD) with λ = 0.25. The dashed line represents the random screening result.

Ten histamine H1 receptor antagonists were included in the compound library (see Figure 4). Figure 5 shows the known active compound (template) and the best ranked molecules, when λ = 0, 0.25 and 1. The Dist values and z-scores (= (score – average score)/standard deviation) of the three top-ranked compounds are summarized in Table 1. These z-scores show that the score distribution is slightly different from the Gaussian distribution. In the Gaussian distribution, the number of compounds with a z-score >3 is 0.1% of the database (10 compounds in this case). The z-scores of the top-ranked compounds were only 2, in this case. The selected molecules look similar to the template compound, diphenhydramine. For λ = 0.25, the other H1 receptor antagonist, chlorpheniramine, was found as the third compound. Both diphenhydramine and chlorpheniramine have a diphenyl group. The other compounds (a1 and a2) are not very similar to the template. For λ = 0, compound b2, which is not an H1 antagonist, is similar to the template. Compound b3 is identical to compound a1. For λ = 1, the three best-ranked compounds are not particularly similar to the query. Compounds a1, a2, b3, and c3 are similar to each other.

Figure 5

The three top-ranked compounds by the MGD method for the template: diphenhydramine. (a) The result with λ = 0.25. (b) The result with λ = 0. (c) The result with λ = 1.

5. Discussion

Figure 3 shows that 36.5% of the active compounds were found within the first 1% of the compounds of the whole library. Our previous study showed that 12.4%, 43.4%, and 67.5% of the active compounds were found within the first 1% of the database by the docking score index (DSI), factor-selection DSI (FS-DSI), and machine-learning DSI (ML-DSI) methods, respectively, when 180 proteins were used to calculate the affinity fingerprint [19]. The three-dimensional (3D) shape and charge distribution of a compound govern the protein-compound binding energy. The DSI, FS-DSI, and ML-DSI methods utilize the 3D shape and charge distribution of the compound through the affinity fingerprint. On the other hand, the 2D structure of the compound does not govern the protein-compound binding energy. Thus, the current similarity measure was not better than the previously developed screening methods for in silico drug screening, when it was used as a single measure to describe the molecular similarity. However, the MGD method did have an advantage in terms of its computational speed. The MGD method can search 10 000 000 compounds within 1 hour on a Xeon 3 GHz computer, which is 1000 times faster than the MSM-DSI method. However, the current similarity measure was still effective for in silico drug screening. Our active compounds were chosen based on literature. As shown in Figure 4, some compounds were very similar to each other by human-eye inspection. The main reason for this similarity was likely that these compounds were generated from a progenitor compound by lead optimization. Diphenhydramine, chlorpheniramine, homochlorcyclizine, cetirizine, and clemastine have a diphenyl-like group. In promethazine, olopatadine, mequitazine, and cyprohrptadine, the conformations of two phenyl groups are fixed. Most of these antagonists are structurally similar, which should be the reason why the current similarity measure was effective for the in silico drug screening. In other words, this method is not suitable for scaffold hopping (lead hopping) [30]. For scaffold hopping, other methods have been developed [30, 31]. If all matrix elements of the off-diagonal part are zero, the eigenvalues are equal to the values of the diagonal part. The off-diagonal part shifts the eigenvalues from the values of the diagonal part. In (1), the bond information close to the ith atom can give major perturbation on the ith diagonal value (atomic charge of the ith atom). Thus the matrix E represents the short-range information of the molecular topology. On the other hand, in (3), the i-j matrix element of the matrix D becomes large when the ith atom is far from the jth atom on the molecular topology. Thus the matrix D can represent the long-range information of the molecular topology. The information of the matrix E and that of D are independent of each other. For each compound in the test database of 104 compounds, the values of S and S in (6) were calculated. The correlation coefficient of the values of S and S was only 0.08, indicating that there was no correlation between these values. Thus, the compounds in the compound library are widely distributed in the (S, S) two-dimensional (2D) space, and the MGD method selects the compounds around the query compound from the compound library in the 2D space.

6. Conclusion

We developed a similarity measure for chemical compounds that is based on the molecular topology, the atomic charge, and the minimum path length between atoms. The histograms of eigenvalues of these matrices were smoothed to generate a continuous curve. The difference and overlap between these two histograms define the distance between the two compounds. This similarity measure was applied to ligand-based in silico drug screening. In this calculation, compounds whose molecular topology structures are similar to the given active compounds were selected by using this similarity measure. This measure actually worked to find unknown active compounds from a random compound library. The 160 active compounds are listed in the Supplemental Materials (compound names, SMILES and figures of g(e) of two compounds). Click here for additional data file.

17 in total

1. Flexsim-X: a method for the detection of molecules with similar biological activity

Authors:
Journal: J Chem Inf Comput Sci Date: 2000-03

2. Flexsim-R: a virtual affinity fingerprint descriptor to calculate similarities of functional groups.

Authors: Alexander Weber; Andreas Teckentrup; Hans Briem
Journal: J Comput Aided Mol Des Date: 2002-12 Impact factor: 3.686

3. Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways.

Authors: Masahiro Hattori; Yasushi Okuno; Susumu Goto; Minoru Kanehisa
Journal: J Am Chem Soc Date: 2003-10-01 Impact factor: 15.419

4. Multiple target screening method for robust and accurate in silico ligand screening.

Authors: Yoshifumi Fukunishi; Yoshiaki Mikami; Satoru Kubota; Haruki Nakamura
Journal: J Mol Graph Model Date: 2005-12-22 Impact factor: 2.518

5. Finding ligands for G protein-coupled receptors based on the protein-compound affinity matrix.

Authors: Yoshifumi Fukunishi; Satoru Kubota; Haruki Nakamura
Journal: J Mol Graph Model Date: 2006-05-10 Impact factor: 2.518

6. A virtual active compound produced from the negative image of a ligand-binding pocket, and its application to in-silico drug screening.

Authors: Yoshifumi Fukunishi; Satoru Kubota; Chisato Kanai; Haruki Nakamura
Journal: J Comput Aided Mol Des Date: 2006-06-21 Impact factor: 3.686

7. Noise reduction method for molecular interaction energy: application to in silico drug screening and in silico target protein screening.

Authors: Yoshifumi Fukunishi; Satoru Kubota; Haruki Nakamura
Journal: J Chem Inf Model Date: 2006 Sep-Oct Impact factor: 4.956

8. Molecular similarity based on DOCK-generated fingerprints.

Authors: H Briem; I D Kuntz
Journal: J Med Chem Date: 1996-08-16 Impact factor: 7.446

9. Classification of chemical compounds by protein-compound docking for use in designing a focused library.

Authors: Yoshifumi Fukunishi; Yoshiaki Mikami; Kei Takedomi; Masaya Yamanouchi; Hideaki Shima; Haruki Nakamura
Journal: J Med Chem Date: 2006-01-26 Impact factor: 7.446

10. Novel cyclooxygenase-1 inhibitors discovered using affinity fingerprints.

Authors: Nancy Hsu; Danying Cai; Komath Damodaran; Reinaldo F Gomez; James G Keck; Edgardo Laborde; Robert T Lum; Thomas J Macke; Graeme Martin; Steven R Schow; Reyna J Simon; Hugo O Villar; Michael M Wick; Paul Beroza
Journal: J Med Chem Date: 2004-09-23 Impact factor: 7.446

1 in total

1. Semantic similarity for automatic classification of chemical compounds.

Authors: João D Ferreira; Francisco M Couto
Journal: PLoS Comput Biol Date: 2010-09-23 Impact factor: 4.475

1 in total