Literature DB >> 32548491

Finding a Maximum Common Subgraph from Molecular Structural Formulas through the Maximum Clique Approach Combined with the Ising Model.

Abstract

We examined the maximum common subgraph (MCS) of four neuraminidase inhibitors that were antiviral medication for treating and preventing type A and B influenza viruses. The MCS was obtained by finding a maximum clique of an association graph constructed from the two input chemical structural formulas. Maximum clique problem was reformulated to Ising Hamiltonian to allow for applying various techniques for optimization. We observed that the combined label for a vertex composed of elemental species and chemical bonds significantly worked well for decreasing the number of vertices in the association graph, which in turn helped to reduce the computational cost.

Entities: Chemical Disease Gene Species

Year: 2020 PMID： 32548491 PMCID： PMC7288595 DOI： 10.1021/acsomega.0c00987

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

A chemical structural formula is often regarded as a kind of graph where atoms and chemical bonds correspond to vertices (V) and edges (E) of the graph G = (V, E), respectively. It is for this reason that approaches based on graph theory have been often used to deal with problems related with the structural formula.[1−3] For example, finding a maximum common subgraph (MCS) from multiple structural formulas is an essential work for drug discovery because the molecules that have similar partial structures are expected to have similar drug efficacy.[4] In graph theory, two undirected graphs, G1 = (V1, E1) and G2 = (V2, E2), are isomorphic if there is a bijection between their vertex sets that preserves adjacency. An MCS stands for a graph that has the largest number of vertices of any graph isomorphic to the induced subgraph of G1 and G2. The MCS has an extremely high computational complexity where several well-known NP-complete problems reduced to it. In the worst case, the number of MCSs increases exponentially as the number of vertices in the graph increases. In addition to applying a structural formula,[5] MCS is a familiar and challenging problem in graph theory because it offers a plethora of applications such as bioinformatics[6] and pattern recognition.[7,8] There are two types of approaches to solve the MCS problem: one is clique-based approaches, and the other is nonclique-based backtracking approaches.[9] The former approaches are the most widely used approaches in the literature, where the MCS problem results in finding a maximum clique in the association graph. An association graph is a particular form of a product graph generated from two input graphs G1 and G2. The definition of the association graph will be given in the Methods section. In graph theory, the maximum size of the adjacency matrix of the association graph is |V1||V2|. The value becomes large unless the size of each graph of G1 and G2 is sufficiently small. To reduce the computational requirements in solving the MCS problem, it appears to be indispensable to exploit attribute information to reduce the size of the adjacency matrix of the association graph. In the molecular structural formula, each vertex, namely, an atom is labeled by a chemical element. In addition, each edge corresponds to a chemical bond, which might also be used as another attribute label. Although a maximum clique in an arbitrary graph can be found using the branch-and-bound algorithm combined with approximate coloring to obtain an upper bound on the size of a maximum clique,[10] it also can be found through the Ising model.[11] It is noteworthy that quantum[12] and classical[13−15] annealing techniques based on the Ising model have attracted great attention over several years as a promising tool for computationally intensive optimization problems. Although it remains to be clarified whether such techniques are more efficient and accurate than conventional metaheuristics, such as simulated annealing and genetic algorithm, or not, the Ising model approach appears to be promising because we can expect further advancement in both software and hardware associated with the annealing-related techniques. In this study, we examined the MCS of four neuraminidase inhibitors. A neuraminidase inhibitor obstructs the action of viral neuraminidases of the influenza virus by preventing its proliferation from the host cell. The maximum cliques of the association graph generated from the two input chemical structural formulas were determined by the tabu search combined with the Ising model. In addition, we examined the effect of the attribute label for a vertex by combining both information of the chemical element and the chemical bond in the construction of the association graph.

Results and Discussion

Simple Example of the Association Graph

We first explain a simple case of an association graph generated from two input graphs G1 = (V1, E1) and G2 = (V2, E2) without attribute information (top panel in Figure ). According to the definition of the association graph (see Methods section), two vertices (A, α) and (B, β) in the association graph are adjacent because (A, B) ϵ E1 and (α, β) ϵ E2. Similarly, two vertices (A, α) and (D, δ) in the association graph are adjacent because (A, D) ∉ E1 and (α, δ) ∉ E2. On the other hand, two vertices (C, γ) and (D, δ) in the association graph are not adjacent because (C, D) ϵ E1 but (γ, δ) ∉ E2. There are four maximum cliques in the association graph: {(A, α), (B, β), (C, δ)}, {(A, α), (B, β), (D, δ)}, {(A, δ), (B, β), (C, α)}, and {(A, δ), (B, β), (D, α)}. Subsequently, we consider the case where the vertices have some kind of attribute information, as shown in the bottom panel in Figure , where the vertices were distinguished by the colors: yellow and blue. There is one maximum clique {(A, α), (B, β), (D, δ)} in the association graph because vertices with different colors do not match. It is noteworthy that the number of vertices in the association graph is decreased from 16 to 8 by considering the attribution information in this example.

Figure 1

Two input graphs for generating an association graph. The case where vertices do not have attribute information (top panel). The case where vertices have attribute information distinguished by colors (bottom panel).

MCS of Zanamivir and Laninamivir Octanoate

Zanamivir (C12H20N4O7) is the first neuraminidase inhibitor commercially developed.[16] It is a medication used to treat influenza caused by type A and B influenza viruses, but it does not have efficacy for the type C virus which does not have neuraminidase. Laninamivir octanoate (C21H36N4O8) is also a neuraminidase inhibitor.[17] The molecule is a prodrug of an octanoic acid ester and undergoes hydrolysis to form an active metabolite of laninamivir (C13H22N4O7) by the body. In MCS problem concerning a chemical structural formula, hydrogen atoms are often ignored to emphasize the structural similarities of the basic skeleton. It is noteworthy that the basic skeleton of zanamivir obtained by ignoring hydrogen atoms is contained in that of laninamivir octanoate, as shown in Figure .

Figure 2

Structural formula of zanamivir (right) and laninamivir octanoate (left). The MCS is represented by red. Although some hydrogen atoms are common between two molecules, they remain black because they were ignored in the scheme of MCS extraction. Although identifying the MCS between zanamivir and laninamivir octanoate was rather a trivial task for this reason, computation through the Ising model correctly assigned the entire basic skeleton of zanamivir as the MCS of the two molecules. In this case, one connected graph was obtained as the MCS by finding a maximum clique of the association graph generated by the two input graphs. Note that, however, the present approach provides the MCS that has the maximum number of constituent vertices, which does not necessarily mean that the obtained MCS is connected. The number of vertices in the association graph was reduced from 759 to 324 and to 118 by considering the label of elemental species only and the combined label of both elemental species and bond information, respectively. It is noteworthy that the value of 78.6% of the off diagonal elements in the reduced adjacency matrix obtained using the combined labels becomes 1. This indicates that most vertices are connected by edges. Although the structural similarity between zanamivir and laninamivir is notable, zanamivir and laninamivir have different dose regimens. The former requires twice a day for five days as the duration of drug exposure. By contrast, the latter needs just one inhalation administration. Octanoic acid in laninamivir octanoate is an 8-carbon straight-chain fatty acid that is almost insoluble in water. Therefore, when it is inhaled, it is not taken up into the blood but passes through the lipid bilayer of mucosal cells and is slowly broken down in the cytoplasm, enabling it to act for a long time. That makes the difference in the dose regimen between laninamivir and zanamivir.

MCS of Zanamivir and Oseltamivir

Subsequently, we examined the MCS of zanamivir and oseltamivir (C16H28N2O4). Oseltamivir is also a neuraminidase inhibitor, and it is used internally unlike inhalation drug such as zanamivir and laninamivir.[18] We can observe the structural similarity between the two molecules even from plain visual comparison of Figure : both molecules contain a six-membered ring and have the same groups of atoms such as −COO and −NHCOCH3. Figure also shows an extracted MCS through a maximum clique of the association graph generated by the two input graphs. Unlike the connected graph in the above-mentioned case, the MCS consists of two disconnected graphs designated by red and sky blue parts. Disconnected graphs are usually obtained by the present approach. In addition, as shown in the figure, we observed that there were four other cases with respect to the possible matching of vertices in the small subgraph comprising four atoms OCCC: (a,c), (a,d), (b,c), (b,d).

Figure 3

Structural formula of zanamivir (right) and oseltamivir (left). The MCS consists of large and small subgraphs represented by red and sky blue parts (top panel). There are other cases (a–d) with respect to the possible matching of vertices for the small subgraph. The number of vertices in the association graph was reduced from 506 to 228 and to 86 by considering the label of elemental species only and the combined label of both elemental species and bond information, respectively. The value of 75.8% of the off diagonal elements in the reduced adjacency matrix obtained by the combined labels becomes 1. It is noteworthy that although the carbon atom designated by a green arrow in Figure should be included in the MCS at first glance, the atom does not belong to the MCS because of the constraint that originates from the definition of the isomorphism of the association graph. The O atom designated by the blue arrow is adjacent to the carbon atom in the left graph. The atom, however, is not adjacent to the carbon atom in the right graph. Consequently, two vertices (Cleft, Cright) and (Oleft, Oright) are not adjacent in the association graph, which indicates that the carbon atom cannot be included in the clique because all vertices in the clique must be adjacent. This result illustrates that the present approach does not necessarily provide the largest connected MCS.

MCS of Zanamivir and Peramivir

Because peramivir (C15H28N4O4) has three inhibition sites for neuraminidase, it shows a high breeding depression effect toward influenza A and B viruses, and it is effective in a single dose through intravenous drip injection.[19] As shown in Figure , we observe that the structural similarity between the two molecules is not so much high as the above-mentioned cases at first glance: the size of the ring is different between the two molecules; nonetheless, they have in common that the same groups of atoms are attached to each ring. The MCS consists of 17 vertices, which indicates that the size of it is larger than that in Section . Figure shows an example of the computed MCS through a maximum clique of the association graph generated by the two input graphs. We observed that the MCS consist of four subgraphs distinguished by colors: red, blue, green, and purple. The smallest subgraph contains just one O atom. Note that there are many MCSs that have the same size. The number of vertices in the association graph was reduced from 529 to 224 and to 84 by considering the label of elemental species only and the combined label of both elemental species and bond information, respectively. The value of 77.5% of the off diagonal elements in the reduced adjacency matrix obtained by using the combined label becomes 1. It is noteworthy that although the carbon atom designated by arrow in Figure should be included in the MCS at first glance, the atom is not included in the MCS because of the definition of isomorphism of the association graph, as explained in Section . We used 0.9 for the value of hyperparameter C to preserve the constraint conditions in this case.

Figure 4

Structural formula of zanamivir (right) and peramivir (left). The MCS consists of four subgraphs represented by red, sky blue, green, and purple parts.

Structural formula of zanamivir (right) and peramivir (left). The MCS consists of four subgraphs represented by red, sky blue, green, and purple parts. Because combinatorial optimization based on the Ising model is a heuristic approach, the optimality of the results is not guaranteed; however, it is obvious in the case of laninamivir octanoate and zanamivir because all heavier atoms in zanamivir are included in the obtained MCS. For the cases of oseltamivir and zanamivir as well as peramivir and zanamivir, we confirmed that the number of vertices of the obtained MCSs agrees with the number of vertices of the maximum clique of the association graphs computed using an API (find_cliques) of NetworkX,[20] a tool for complex network research. In addition, MCSs of other three pairs of molecules (laninamivir octanoate, oseltamivir), (laninamivir octanoate, peramivir), and (oseltamivir, peramivir) were also examined to make the study more exhaustive, and the results were shown as Figure S1 in the Supporting Information.

Conclusions

We examined the MCSs among zanamivir, laninamivir octanoate, oseltamivir, and peramivir. The MCS was determined by finding the maximum clique of the association graph that were generated from two input chemical structural formulas combined with the Ising model. The combined label determined from elemental species and chemical bond significantly worked well for decreasing the number of vertices in the association graph, which in turn, reduced the computational cost. The following point shall be remarked: the present approach finds the MCS that is not necessary a connected graph but rather a graph with the maximum number of vertices. This might cause somewhat of a problem because most tools for the MCS in cheminformatics such as the RDkit[21] try to find the maximum connected graph at least in the default setting. Note that the connected part of the graph in the MCS determined by the present approach does not necessarily correspond to the maximum connected graph of the association graph. Thus, it would be desirable to apply the present approach to the problems that place a premium in deciding the MCS comprising the disconnected graphs. It is also noteworthy that the MCS among three or more molecules can be found by the straightforward extension of the isomorphic mapping for generating the association graph. However, the approach results in significant increase of the computational cost: the number of vertices in the extended association graph of m molecules becomes as many as in the worst case. Consequently, it would be more practical to extract the MCS of many molecules on a one-by-one basis instead of considering all molecules at a time by extending the association graph.

Computational Methods

Association Graph

Given two undirected graphs G1 = (V1, E1) and G2 = (V2, E2), the association graph G = (V, E) is an undirected graph defined on the vertex set V = V1 × V2 with two vertices (u1, v1) and (u2, v2) being adjacent whenever

Labeling of Atoms

There are two types of labels with respect to structural formula: elemental species and chemical bonds corresponding to vertices and edges, respectively. It is noteworthy that the label for the edge can be combined into the label for the vertex if we define the vertex label as elemental species having specific bond information. This significantly reduces the number of possible matching of vertices in the construction of the association graph. We defined the combined label as follows: tens place and ones place digits correspond to elemental species and bond information labels, respectively. In fact, the combined label of atom X (L[X]) was assigned by L[X] = (atomic number of X) × 10 + (bond order of X). For example, in the case of C≡O, L[C] = 6 × 10 + 3 (=63) and L[O] = 8 × 10 + 3 (=83). To further reduce the actual number of vertices and to focus on a basic skeleton of a molecule formed by elements heavier than H, we ignored H from the structural formula. We assumed that the atoms in and out of the ring did not match in the association graph and set the atoms constituting the ring to have the bond order of 4 for the convenience of the identification. In other words, we did not distinguish double bonds from single bonds in the ring. Let us consider a hydrocarbon (CH2) consisting of n carbon atoms with m C=C double bonds. (However, we ignored =C= to simplify the analysis.) It is noteworthy that the molecule has (n – 2m) >C< and 2m >C=. If we construct an association graph from two CH2 molecules, the number of vertices in the graph will be n2 for the vertex label that considers only elements, but (n – 2m)2 + 4m2 for the vertex label consolidates elements and bonds. The ratio of the latter to the former is significantly smaller than 1 over a wide range of n for a certain magnitude of m, as shown in Figure . Because quadratic unconstrained binary optimization (QUBO) is on the scale of the square of the number of vertices and because the increase of the number of vertices requires more optimization iterations, the computational complexity can be considered to be substantially proportional to the third power of the number of vertices or more, the reduction of the number of vertices is very important for improving the efficiency of the computation.

Figure 5

Ratio of the number of vertices of the association graph from two CH2 molecules. Numerator corresponds to the number from the vertex label consolidating elements and bonds while the denominator corresponds to the number considering only elements.

Solving Maximum Clique Problem through Ising Model

A clique is an induced subgraph of G that forms a complete graph: every two distinct vertices in the clique are adjacent. A maximum clique of a graph G is a clique such that there is no clique with more vertices. The NP-complete decision problem of whether or not a clique of the size K exists can be written as an Ising-like model using a binary bit variable x (0 or 1).[11] The relevant Hamiltonian (H) for the maximum clique problem is the sum of three partial Hamiltonian operatorswhereand Here, N is the number of vertices of graph G, and y is an ancillary binary bit to represent a clique whose size is n. It is noteworthy that HA becomes zero only when the number of selected vertices x is n. Similarly, HB becomes zero only when the selected vertices x form a complete graph. HC means that it is energetically favorable to include as many vertices as possible into the clique, although the aggressive inclusion may be penalized by the terms HA and HB. The constructed Ising Hamiltonian of the clique problem is a quadratic expression of the binary variables (xv and y), which is consistent with the QUBO form to which various optimization techniques are applicable. Here, we used Qbsolv (tabu search) that is an application programing interface supported in D-Wave Ocean (software development kit developed by D-Wave Systems). We set the hyperparameters A, B, and C in the Hamiltonian to be A = (Δ+ 2) B and B = C as suggested by Lucas.[11] In particular, we set Δ = min (|V1|, |V2|) and C = 1 unless otherwise stated. In actual computation, we used the following binary approach to reduce the number of ancillary bits from N to M + 1 as explained by Lucas

9 in total

Review 1. Oseltamivir: a review of its use in influenza.

Authors: K McClellan; C M Perry
Journal: Drugs Date: 2001 Impact factor: 9.546

2. Protein function prediction via graph kernels.

Authors: Karsten M Borgwardt; Cheng Soon Ong; Stefan Schönauer; S V N Vishwanathan; Alex J Smola; Hans-Peter Kriegel
Journal: Bioinformatics Date: 2005-06 Impact factor: 6.937

3. Anti-influenza virus activity of peramivir in mice with single intramuscular injection.

Authors: Shanta Bantia; C Shane Arnold; Cynthia D Parker; Ramanda Upshaw; Pooran Chand
Journal: Antiviral Res Date: 2005-11-21 Impact factor: 5.970

4. Quantum annealing with manufactured spins.

Authors: M W Johnson; M H S Amin; S Gildert; T Lanting; F Hamze; N Dickson; R Harris; A J Berkley; J Johansson; P Bunyk; E M Chapple; C Enderud; J P Hilton; K Karimi; E Ladizinsky; N Ladizinsky; T Oh; I Perminov; C Rich; M C Thom; E Tolkacheva; C J S Truncik; S Uchaikin; J Wang; B Wilson; G Rose
Journal: Nature Date: 2011-05-12 Impact factor: 49.962

1 in total

1. Maximizing gerrymandering through ising model optimization.

Authors: Yasuharu Okamoto
Journal: Sci Rep Date: 2021-12-08 Impact factor: 4.379