Yasuharu Okamoto1. 1. The System Platform Research Laboratories, NEC Corporation, 34 Miyukigaoka, Tsukuba, Ibaraki 305-8568, Japan.
Abstract
We examined the maximum common subgraph (MCS) of four neuraminidase inhibitors that were antiviral medication for treating and preventing type A and B influenza viruses. The MCS was obtained by finding a maximum clique of an association graph constructed from the two input chemical structural formulas. Maximum clique problem was reformulated to Ising Hamiltonian to allow for applying various techniques for optimization. We observed that the combined label for a vertex composed of elemental species and chemical bonds significantly worked well for decreasing the number of vertices in the association graph, which in turn helped to reduce the computational cost.
We examined the maximum common subgraph (MCS) of four neuraminidase inhibitors that were antiviral medication for treating and preventing type A and Binfluenza viruses. The MCS was obtained by finding a maximum clique of an association graph constructed from the two input chemical structural formulas. Maximum clique problem was reformulated to Ising Hamiltonian to allow for applying various techniques for optimization. We observed that the combined label for a vertex composed of elemental species and chemical bonds significantly worked well for decreasing the number of vertices in the association graph, which in turn helped to reduce the computational cost.
A chemical structural formula is often regarded as a kind of graph
where atoms and chemical bonds correspond to vertices (V) and edges (E) of the graph G =
(V, E), respectively. It is for
this reason that approaches based on graph theory have been often
used to deal with problems related with the structural formula.[1−3] For example, finding a maximum common subgraph (MCS) from multiple
structural formulas is an essential work for drug discovery because
the molecules that have similar partial structures are expected to
have similar drug efficacy.[4] In graph theory,
two undirected graphs, G1 = (V1, E1) and G2 = (V2, E2), are isomorphic if there is a bijection between their
vertex sets that preserves adjacency. An MCS stands for a graph that
has the largest number of vertices of any graph isomorphic to the
induced subgraph of G1 and G2. The MCS has an extremely high computational complexity
where several well-known NP-complete problems reduced to it. In the
worst case, the number of MCSs increases exponentially as the number
of vertices in the graph increases. In addition to applying a structural
formula,[5] MCS is a familiar and challenging
problem in graph theory because it offers a plethora of applications
such as bioinformatics[6] and pattern recognition.[7,8]There are two types of approaches to solve the MCS problem:
one
is clique-based approaches, and the other is nonclique-based backtracking
approaches.[9] The former approaches are
the most widely used approaches in the literature, where the MCS problem
results in finding a maximum clique in the association graph. An association
graph is a particular form of a product graph generated from two input
graphs G1 and G2. The definition of the association graph will be given in the Methods section. In graph theory, the maximum size
of the adjacency matrix of the association graph is |V1||V2|. The value becomes
large unless the size of each graph of G1 and G2 is sufficiently small. To reduce
the computational requirements in solving the MCS problem, it appears
to be indispensable to exploit attribute information to reduce the
size of the adjacency matrix of the association graph. In the molecular
structural formula, each vertex, namely, an atom is labeled by a chemical
element. In addition, each edge corresponds to a chemical bond, which
might also be used as another attribute label.Although a maximum
clique in an arbitrary graph can be found using
the branch-and-bound algorithm combined with approximate coloring
to obtain an upper bound on the size of a maximum clique,[10] it also can be found through the Ising model.[11] It is noteworthy that quantum[12] and classical[13−15] annealing techniques based on
the Ising model have attracted great attention over several years
as a promising tool for computationally intensive optimization problems.
Although it remains to be clarified whether such techniques are more
efficient and accurate than conventional metaheuristics, such as simulated
annealing and genetic algorithm, or not, the Ising model approach
appears to be promising because we can expect further advancement
in both software and hardware associated with the annealing-related
techniques.In this study, we examined the MCS of four neuraminidase
inhibitors.
A neuraminidase inhibitor obstructs the action of viral neuraminidases
of the influenza virus by preventing its proliferation from the host
cell. The maximum cliques of the association graph generated from
the two input chemical structural formulas were determined by the
tabu search combined with the Ising model. In addition, we examined
the effect of the attribute label for a vertex by combining both information
of the chemical element and the chemical bond in the construction
of the association graph.
Results and Discussion
Simple Example of the Association Graph
We first explain
a simple case of an association graph generated
from two input graphs G1 = (V1, E1) and G2 = (V2, E2) without attribute information (top panel in Figure ). According to the
definition of the association graph (see Methods
section), two vertices (A, α) and (B, β) in the association graph are adjacent because
(A, B) ϵ E1 and (α, β) ϵ E2. Similarly, two vertices (A, α) and
(D, δ) in the association graph are adjacent
because (A, D) ∉ E1 and (α, δ) ∉ E2. On the other hand, two vertices (C, γ) and (D, δ) in the association graph
are not adjacent because (C, D)
ϵ E1 but (γ, δ) ∉ E2. There are four maximum cliques in the association
graph: {(A, α), (B, β),
(C, δ)}, {(A, α), (B, β), (D, δ)}, {(A, δ), (B, β), (C, α)},
and {(A, δ), (B, β),
(D, α)}. Subsequently, we consider the case
where the vertices have some kind of attribute information, as shown
in the bottom panel in Figure , where the vertices were distinguished by the colors: yellow
and blue. There is one maximum clique {(A, α),
(B, β), (D, δ)} in the
association graph because vertices with different colors do not match.
It is noteworthy that the number of vertices in the association graph
is decreased from 16 to 8 by considering the attribution information
in this example.
Figure 1
Two input graphs for generating an association graph.
The case
where vertices do not have attribute information (top panel). The
case where vertices have attribute information distinguished by colors
(bottom panel).
Two input graphs for generating an association graph.
The case
where vertices do not have attribute information (top panel). The
case where vertices have attribute information distinguished by colors
(bottom panel).
MCS of
Zanamivir and Laninamivir Octanoate
Zanamivir (C12H20N4O7) is the first neuraminidase
inhibitor commercially developed.[16] It
is a medication used to treat influenza caused
by type A and Binfluenza viruses, but it does not have efficacy for
the type C virus which does not have neuraminidase. Laninamivir octanoate
(C21H36N4O8) is also a
neuraminidase inhibitor.[17] The molecule
is a prodrug of an octanoic acid ester and undergoes hydrolysis to
form an active metabolite of laninamivir (C13H22N4O7) by the body. In MCS problem concerning
a chemical structural formula, hydrogen atoms are often ignored to
emphasize the structural similarities of the basic skeleton. It is
noteworthy that the basic skeleton of zanamivir obtained by ignoring
hydrogen atoms is contained in that of laninamivir octanoate, as shown
in Figure .
Figure 2
Structural
formula of zanamivir (right) and laninamivir octanoate
(left). The MCS is represented by red. Although some hydrogen atoms
are common between two molecules, they remain black because they were
ignored in the scheme of MCS extraction.
Structural
formula of zanamivir (right) and laninamivir octanoate
(left). The MCS is represented by red. Although some hydrogen atoms
are common between two molecules, they remain black because they were
ignored in the scheme of MCS extraction.Although identifying the MCS between zanamivir and laninamivir
octanoate was rather a trivial task for this reason, computation through
the Ising model correctly assigned the entire basic skeleton of zanamivir
as the MCS of the two molecules. In this case, one connected graph
was obtained as the MCS by finding a maximum clique of the association
graph generated by the two input graphs. Note that, however, the present
approach provides the MCS that has the maximum number of constituent
vertices, which does not necessarily mean that the obtained MCS is
connected. The number of vertices in the association graph was reduced
from 759 to 324 and to 118 by considering the label of elemental species
only and the combined label of both elemental species and bond information,
respectively. It is noteworthy that the value of 78.6% of the off
diagonal elements in the reduced adjacency matrix obtained using the
combined labels becomes 1. This indicates that most vertices are connected
by edges. Although the structural similarity between zanamivir and
laninamivir is notable, zanamivir and laninamivir have different dose
regimens. The former requires twice a day for five days as the duration
of drug exposure. By contrast, the latter needs just one inhalation
administration. Octanoic acid in laninamivir octanoate is an 8-carbon
straight-chain fatty acid that is almost insoluble in water. Therefore,
when it is inhaled, it is not taken up into the blood but passes through
the lipid bilayer of mucosal cells and is slowly broken down in the
cytoplasm, enabling it to act for a long time. That makes the difference
in the dose regimen between laninamivir and zanamivir.
MCS of Zanamivir and Oseltamivir
Subsequently, we examined
the MCS of zanamivir and oseltamivir (C16H28N2O4). Oseltamivir is
also a neuraminidase inhibitor, and it is used internally unlike inhalation
drug such as zanamivir and laninamivir.[18] We can observe the structural similarity between the two molecules
even from plain visual comparison of Figure : both molecules contain a six-membered ring
and have the same groups of atoms such as −COO and −NHCOCH3. Figure also
shows an extracted MCS through a maximum clique of the association
graph generated by the two input graphs. Unlike the connected graph
in the above-mentioned case, the MCS consists of two disconnected
graphs designated by red and sky blue parts. Disconnected graphs are
usually obtained by the present approach. In addition, as shown in
the figure, we observed that there were four other cases with respect
to the possible matching of vertices in the small subgraph comprising
four atoms OCCC: (a,c), (a,d), (b,c), (b,d).
Figure 3
Structural formula of
zanamivir (right) and oseltamivir (left).
The MCS consists of large and small subgraphs represented by red and
sky blue parts (top panel). There are other cases (a–d) with
respect to the possible matching of vertices for the small subgraph.
Structural formula of
zanamivir (right) and oseltamivir (left).
The MCS consists of large and small subgraphs represented by red and
sky blue parts (top panel). There are other cases (a–d) with
respect to the possible matching of vertices for the small subgraph.The number of vertices in the association graph
was reduced from
506 to 228 and to 86 by considering the label of elemental species
only and the combined label of both elemental species and bond information,
respectively. The value of 75.8% of the off diagonal elements in the
reduced adjacency matrix obtained by the combined labels becomes 1.
It is noteworthy that although the carbon atom designated by a green
arrow in Figure should
be included in the MCS at first glance, the atom does not belong to
the MCS because of the constraint that originates from the definition
of the isomorphism of the association graph. The O atom designated
by the blue arrow is adjacent to the carbon atom in the left graph.
The atom, however, is not adjacent to the carbon atom in the right
graph. Consequently, two vertices (Cleft, Cright) and (Oleft, Oright) are not adjacent in the
association graph, which indicates that the carbon atom cannot be
included in the clique because all vertices in the clique must be
adjacent. This result illustrates that the present approach does not
necessarily provide the largest connected MCS.
MCS of
Zanamivir and Peramivir
Because
peramivir (C15H28N4O4)
has three inhibition sites for neuraminidase, it shows a high breeding
depression effect toward influenza A and B viruses, and it is effective
in a single dose through intravenous drip injection.[19] As shown in Figure , we observe that the structural similarity between the two
molecules is not so much high as the above-mentioned cases at first
glance: the size of the ring is different between the two molecules;
nonetheless, they have in common that the same groups of atoms are
attached to each ring. The MCS consists of 17 vertices, which indicates
that the size of it is larger than that in Section . Figure shows an example of the computed MCS through a maximum
clique of the association graph generated by the two input graphs.
We observed that the MCS consist of four subgraphs distinguished by
colors: red, blue, green, and purple. The smallest subgraph contains
just one O atom. Note that there are many MCSs that have the same
size. The number of vertices in the association graph was reduced
from 529 to 224 and to 84 by considering the label of elemental species
only and the combined label of both elemental species and bond information,
respectively. The value of 77.5% of the off diagonal elements in the
reduced adjacency matrix obtained by using the combined label becomes
1. It is noteworthy that although the carbon atom designated by arrow
in Figure should
be included in the MCS at first glance, the atom is not included in
the MCS because of the definition of isomorphism of the association
graph, as explained in Section . We used 0.9 for the value of hyperparameter C to preserve the constraint conditions in this case.
Figure 4
Structural
formula of zanamivir (right) and peramivir (left). The
MCS consists of four subgraphs represented by red, sky blue, green,
and purple parts.
Structural
formula of zanamivir (right) and peramivir (left). The
MCS consists of four subgraphs represented by red, sky blue, green,
and purple parts.Because combinatorial
optimization based on the Ising model is
a heuristic approach, the optimality of the results is not guaranteed;
however, it is obvious in the case of laninamivir octanoate and zanamivir
because all heavier atoms in zanamivir are included in the obtained
MCS. For the cases of oseltamivir and zanamivir as well as peramivir
and zanamivir, we confirmed that the number of vertices of the obtained
MCSs agrees with the number of vertices of the maximum clique of the
association graphs computed using an API (find_cliques) of NetworkX,[20] a tool for complex network research. In addition,
MCSs of other three pairs of molecules (laninamivir octanoate, oseltamivir),
(laninamivir octanoate, peramivir), and (oseltamivir, peramivir) were
also examined to make the study more exhaustive, and the results were
shown as Figure S1 in the Supporting Information.
Conclusions
We examined the MCSs among
zanamivir, laninamivir octanoate, oseltamivir,
and peramivir. The MCS was determined by finding the maximum clique
of the association graph that were generated from two input chemical
structural formulas combined with the Ising model. The combined label
determined from elemental species and chemical bond significantly
worked well for decreasing the number of vertices in the association
graph, which in turn, reduced the computational cost.The following
point shall be remarked: the present approach finds
the MCS that is not necessary a connected graph but rather a graph
with the maximum number of vertices. This might cause somewhat of
a problem because most tools for the MCS in cheminformatics such as
the RDkit[21] try to find the maximum connected
graph at least in the default setting. Note that the connected part
of the graph in the MCS determined by the present approach does not
necessarily correspond to the maximum connected graph of the association
graph. Thus, it would be desirable to apply the present approach to
the problems that place a premium in deciding the MCS comprising the
disconnected graphs.It is also noteworthy that the MCS among
three or more molecules
can be found by the straightforward extension of the isomorphic mapping
for generating the association graph. However, the approach results
in significant increase of the computational cost: the number of vertices
in the extended association graph of m molecules becomes as many as in the worst case. Consequently, it would
be more practical to extract the MCS of many molecules on a one-by-one
basis instead of considering all molecules at a time by extending
the association graph.
Computational Methods
Association Graph
Given two undirected
graphs G1 = (V1, E1) and G2 = (V2, E2), the association graph G = (V, E) is an undirected graph defined on the vertex
set V = V1 × V2 with two vertices (u1, v1) and (u2, v2) being adjacent whenever
Labeling of Atoms
There are two types
of labels with respect to structural formula: elemental species and
chemical bonds corresponding to vertices and edges, respectively.
It is noteworthy that the label for the edge can be combined into
the label for the vertex if we define the vertex label as elemental
species having specific bond information. This significantly reduces
the number of possible matching of vertices in the construction of
the association graph. We defined the combined label as follows: tens
place and ones place digits correspond to elemental species and bond
information labels, respectively. In fact, the combined label of atom
X (L[X]) was assigned by L[X] = (atomic number of X) × 10 + (bond
order of X). For example, in the case of C≡O, L[C] = 6 ×
10 + 3 (=63) and L[O] = 8 × 10 + 3 (=83). To further reduce the
actual number of vertices and to focus on a basic skeleton of a molecule
formed by elements heavier than H, we ignored H from the structural
formula. We assumed that the atoms in and out of the ring did not
match in the association graph and set the atoms constituting the
ring to have the bond order of 4 for the convenience of the identification.
In other words, we did not distinguish double bonds from single bonds
in the ring.Let us consider a hydrocarbon (CH2) consisting of n carbon atoms with m C=C double bonds. (However,
we ignored =C= to simplify the analysis.) It is noteworthy
that the molecule has (n – 2m) >C< and 2m >C=. If we construct
an association
graph from two CH2 molecules, the number of vertices
in the graph will be n2 for the vertex
label that considers only elements, but (n –
2m)2 + 4m2 for the vertex label consolidates elements and bonds. The ratio
of the latter to the former is significantly smaller than 1 over a
wide range of n for a certain magnitude of m, as shown in Figure . Because quadratic unconstrained binary optimization
(QUBO) is on the scale of the square of the number of vertices and
because the increase of the number of vertices requires more optimization
iterations, the computational complexity can be considered to be substantially
proportional to the third power of the number of vertices or more,
the reduction of the number of vertices is very important for improving
the efficiency of the computation.
Figure 5
Ratio of the number of vertices of the
association graph from two
CH2 molecules. Numerator corresponds
to the number from
the vertex label consolidating elements and bonds while the denominator
corresponds to the number considering only elements.
Ratio of the number of vertices of the
association graph from two
CH2 molecules. Numerator corresponds
to the number from
the vertex label consolidating elements and bonds while the denominator
corresponds to the number considering only elements.
Solving Maximum Clique Problem through Ising
Model
A clique is an induced subgraph of G that forms a complete graph: every two distinct vertices in the
clique are adjacent. A maximum clique of a graph G is a clique such that there is no clique with more vertices. The
NP-complete decision problem of whether or not a clique of the size
K exists can be written as an Ising-like model using a binary bit
variable x (0 or 1).[11] The relevant Hamiltonian (H) for the maximum clique problem is the sum of three partial Hamiltonian
operatorswhereandHere, N is the number
of vertices of graph G, and y is an ancillary binary bit to represent
a clique whose size is n. It is noteworthy that HA becomes zero only when the number of selected
vertices x is n. Similarly, HB becomes zero
only when the selected vertices x form a complete graph. HC means
that it is energetically favorable to include as many vertices as
possible into the clique, although the aggressive inclusion may be
penalized by the terms HA and HB. The constructed Ising Hamiltonian of the
clique problem is a quadratic expression of the binary variables (xv and y), which is consistent with the QUBO form to which various
optimization techniques are applicable. Here, we used Qbsolv (tabu
search) that is an application programing interface supported in D-Wave
Ocean (software development kit developed by D-Wave Systems). We set
the hyperparameters A, B, and C in the Hamiltonian to be A = (Δ+
2) B and B = C as
suggested by Lucas.[11] In particular, we
set Δ = min (|V1|, |V2|) and C = 1 unless otherwise stated.
In actual computation, we used the following binary approach to reduce
the number of ancillary bits from N to M + 1 as explained by Lucas
Authors: Karsten M Borgwardt; Cheng Soon Ong; Stefan Schönauer; S V N Vishwanathan; Alex J Smola; Hans-Peter Kriegel Journal: Bioinformatics Date: 2005-06 Impact factor: 6.937
Authors: M W Johnson; M H S Amin; S Gildert; T Lanting; F Hamze; N Dickson; R Harris; A J Berkley; J Johansson; P Bunyk; E M Chapple; C Enderud; J P Hilton; K Karimi; E Ladizinsky; N Ladizinsky; T Oh; I Perminov; C Rich; M C Thom; E Tolkacheva; C J S Truncik; S Uchaikin; J Wang; B Wilson; G Rose Journal: Nature Date: 2011-05-12 Impact factor: 49.962