Literature DB >> 20428322

G-Hash: Towards Fast Kernel-based Similarity Search in Large Graph Databases.

Xiaohong Wang1, Aaron Smalter, Jun Huan, Gerald H Lushington.   

Abstract

Structured data including sets, sequences, trees and graphs, pose significant challenges to fundamental aspects of data management such as efficient storage, indexing, and similarity search. With the fast accumulation of graph databases, similarity search in graph databases has emerged as an important research topic. Graph similarity search has applications in a wide range of domains including cheminformatics, bioinformatics, sensor network management, social network management, and XML documents, among others.Most of the current graph indexing methods focus on subgraph query processing, i.e. determining the set of database graphs that contains the query graph and hence do not directly support similarity search. In data mining and machine learning, various graph kernel functions have been designed to capture the intrinsic similarity of graphs. Though successful in constructing accurate predictive and classification models for supervised learning, graph kernel functions have (i) high computational complexity and (ii) non-trivial difficulty to be indexed in a graph database.Our objective is to bridge graph kernel function and similarity search in graph databases by proposing (i) a novel kernel-based similarity measurement and (ii) an efficient indexing structure for graph data management. Our method of similarity measurement builds upon local features extracted from each node and their neighboring nodes in graphs. A hash table is utilized to support efficient storage and fast search of the extracted local features. Using the hash table, a graph kernel function is defined to capture the intrinsic similarity of graphs and for fast similarity query processing. We have implemented our method, which we have named G-hash, and have demonstrated its utility on large chemical graph databases. Our results show that the G-hash method achieves state-of-the-art performance for k-nearest neighbor (k-NN) classification. Most importantly, the new similarity measurement and the index structure is scalable to large database with smaller indexing size, faster indexing construction time, and faster query processing time as compared to state-of-the-art indexing methods such as C-tree, gIndex, and GraphGrep.

Entities:  

Year:  2009        PMID: 20428322      PMCID: PMC2860326          DOI: 10.1145/1516360.1516416

Source DB:  PubMed          Journal:  Adv Database Technol


  5 in total

1.  CHEMICAL COMPOUND CLASSIFICATION WITH AUTOMATICALLY MINED STRUCTURE PATTERNS.

Authors:  A M Smalter; J Huan; G H Lushington
Journal:  Proc Asia Pac Bioinform Conf       Date:  2008

2.  Virtual screening of molecular databases using a support vector machine.

Authors:  Robert N Jorissen; Michael K Gilson
Journal:  J Chem Inf Model       Date:  2005 May-Jun       Impact factor: 4.956

3.  SAGA: a subgraph matching tool for biological graphs.

Authors:  Yuanyuan Tian; Richard C McEachin; Carlos Santos; David J States; Jignesh M Patel
Journal:  Bioinformatics       Date:  2006-11-16       Impact factor: 6.937

4.  BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities.

Authors:  Tiqing Liu; Yuhmei Lin; Xin Wen; Robert N Jorissen; Michael K Gilson
Journal:  Nucleic Acids Res       Date:  2006-12-01       Impact factor: 16.971

5.  A maximum common substructure-based algorithm for searching and predicting drug-like compounds.

Authors:  Yiqun Cao; Tao Jiang; Thomas Girke
Journal:  Bioinformatics       Date:  2008-07-01       Impact factor: 6.937

  5 in total
  1 in total

1.  Application of kernel functions for accurate similarity search in large chemical databases.

Authors:  Xiaohong Wang; Jun Huan; Aaron Smalter; Gerald H Lushington
Journal:  BMC Bioinformatics       Date:  2010-04-29       Impact factor: 3.169

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.