Literature DB >> 29028902

graphkernels: R and Python packages for graph comparison.

Mahito Sugiyama1,2, M Elisabetta Ghisu3,4, Felipe Llinares-López3,4, Karsten Borgwardt3,4.   

Abstract

Summary: Measuring the similarity of graphs is a fundamental step in the analysis of graph-structured data, which is omnipresent in computational biology. Graph kernels have been proposed as a powerful and efficient approach to this problem of graph comparison. Here we provide graphkernels, the first R and Python graph kernel libraries including baseline kernels such as label histogram based kernels, classic graph kernels such as random walk based kernels, and the state-of-the-art Weisfeiler-Lehman graph kernel. The core of all graph kernels is implemented in C ++ for efficiency. Using the kernel matrices computed by the package, we can easily perform tasks such as classification, regression and clustering on graph-structured samples. Availability and implementation: The R and Python packages including source code are available at https://CRAN.R-project.org/package=graphkernels and https://pypi.python.org/pypi/graphkernels. Contact: mahito@nii.ac.jp or elisabetta.ghisu@bsse.ethz.ch. Supplementary information: Supplementary data are available online at Bioinformatics.
© The Author(s) 2017. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2018        PMID: 29028902      PMCID: PMC5860361          DOI: 10.1093/bioinformatics/btx602

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

Graph-structured data are steadily growing and extensively being analyzed in computational biology. For example, chemical compounds are modeled as graphs in drug discovery (Takigawa and Mamitsuka, 2013), and proteins are represented as graphs in protein function prediction (Dhifli and Nguif, 2015). Finding efficient solutions for measuring the similarity between a pair of graphs, known as the graph comparison problem, is a fundamental step in graph analysis, in order to perform classification or regression on graph data. There are two approaches to graph comparison: alignment-based methods (Faisal ) that compare graphs via finding node mappings and alignment-free methods (Yaveroğlu ) that measure the similarity between graphs using features such as degree distributions or subgraph counts without identifying correspondences between nodes. To date, among the alignment-free approaches, graph kernels have become a popular approach to quantify the similarity between graphs (Borgwardt and Kriegel, 2005; Costa and Grave, 2010; Gärtner ; Kashima ; Shervashidze , 2011; Sugiyama and Borgwardt, 2015; Vishwanathan ), and are at the heart of many machine learning approaches in computational biology. A number of key applications of graph kernels exist such as biological function prediction from graph-based representations of chemical compounds. However, there is no convenient R or Python implementation that can simply and efficiently compute graph kernels, although R is a popular programming environment in Bioinformatics and Python in Machine Learning. Here we present graphkernels, the first package in R and Python with efficient C ++ implementations of various graph kernels including the following prominent kernel families: (i) simple kernels between vertex and/or edge label histograms, (ii) graphlet kernels, (iii) random walk kernels (popular baselines) and (iv) the Weisfeiler-Lehman graph kernel (state-of-the-art kernels). The packages can be easily used to perform graph classification and regression by machine learning algorithms (Fig. 1), such as support vector machines (SVMs) or the k-nearest neighbors algorithm.
Fig. 1

Overview. The kernel value K represents the similarity between graphs i and j

Overview. The kernel value K represents the similarity between graphs i and j

2 Materials and methods

Each function implemented in the graphkernels packages receives a collection of graphs G1, G2, …, G and returns the kernel (Gram) matrix with the respective graph kernel, where each kernel value K shows the similarity between graphs G and G. The packages support the following 14 graph kernels: Linear kernels on label histograms: VertexHist, EdgeHist, VertexEdgeHist, VertexVertexEdgeHist. Gaussian RBF kernels on label histograms: VertexHistGauss, EdgeHistGauss, VertexEdgeHistGauss. Graphlet kernels: Graphlet, ConnectedGraphlet. Random walk based kernels: KStepRandomWalk, Geometric RandomWalk, ExponentialRandomWalk, ShortestPath. The Weisfeiler-Lehman subtree kernel: WL. All kernels are implemented in C ++ and compiled through the packages Rcpp (Eddelbuettel, 2013) and RcppEigen (Bates and Eddelbuettel, 2013) in the R package. We use SWIG (Beazley, 1996) interfaces to wrap C ++ code in Python. See the Supplementary Material for detailed mathematical definitions of these graph kernels. In our packages, each graph is treated as an igraph object (Csardi and Nepusz, 2006) and a collection of graphs is kept as a list of igraph graphs. An example usage in R is shown in the following, where we use the dataset MUTAG (Debnath ), a typical benchmark dataset that is also provided in our package. >library(graphkernels) # load the package >data(mutag) # load a sample dataset, ## which is a list of (igraph) graphs > mutag[[1]] IGRAPH f2f3caf U— 23 27 – + attr: label (g/n), label (v/n), label (e/n) + edges: [1] 1– 2 1–14 2– 3… ## the first graph has 23 nodes and 27 labels > K <- CalculateVertexHistKernel(mutag) ## compute the kernel matrix > K[1, 2] [1] 282 ## The kernel value b/w graphs 1 and 2 The entire kernel matrix can be easily computed by a single line of R code. Similar examples in Python can be found in the Supplementary Material.

3 Application

As a representative application, we demonstrate graph classification using the MUTAG dataset. In the dataset, there are 188 graphs, and the objective is to predict labels of graphs, indicating whether or not they are mutagenic. We used 10-fold cross validation for graph classification. We randomly divided the entire dataset into 10 folds. In each iteration 1 of the 10 folds was used for testing and the rest for training. We computed the kernel matrix of the training data using one of our functions implemented in graphkernels, and use these data to train an SVM using the kernlab package (Karatzoglou ). We then predicted labels on the test data, and obtained the accuracy by comparison with the ground-truth labels. The detailed experimental methodology and the R code to reproduce these results are provided in the Supplementary Material. Figure 2 shows the prediction accuracy for graph kernels in our package and the CPU running time needed to compute each kernel matrix. This example demonstrates that our package allows for an easy comparison of the effectiveness and the efficiency of various popular graph kernels and will serve as a baseline when designing new graph kernels for specialized applications in computational biology.
Fig. 2

Accuracy (left) and running time (in seconds, right) on the MUTAG dataset

Accuracy (left) and running time (in seconds, right) on the MUTAG dataset

Funding

MS was funded by JSPS KAKENHI Grant Numbers JP16K16115 and JP16H02870. MEG was funded by Horizon 2020 project CDS-QUAMRI, Grant No. 634541. Click here for additional data file.
  4 in total

Review 1.  Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity.

Authors:  A K Debnath; R L Lopez de Compadre; G Debnath; A J Shusterman; C Hansch
Journal:  J Med Chem       Date:  1991-02       Impact factor: 7.446

Review 2.  Graph mining: procedure, application to drug discovery and recent advances.

Authors:  Ichigaku Takigawa; Hiroshi Mamitsuka
Journal:  Drug Discov Today       Date:  2012-08-05       Impact factor: 7.851

3.  Proper evaluation of alignment-free network comparison methods.

Authors:  Ömer Nebil Yaveroğlu; Tijana Milenković; Nataša Pržulj
Journal:  Bioinformatics       Date:  2015-03-24       Impact factor: 6.937

4.  The post-genomic era of biological network alignment.

Authors:  Fazle E Faisal; Lei Meng; Joseph Crawford; Tijana Milenković
Journal:  EURASIP J Bioinform Syst Biol       Date:  2015-06-04
  4 in total
  5 in total

1.  Evaluation of multidisciplinary collaboration in pediatric trauma care using EHR data.

Authors:  Ashimiyu B Durojaiye; Scott Levin; Matthew Toerper; Hadi Kharrazi; Harold P Lehmann; Ayse P Gurses
Journal:  J Am Med Inform Assoc       Date:  2019-06-01       Impact factor: 4.497

2.  Pathway and network analysis of genes related to osteoporosis.

Authors:  Lin Guo; Jia Han; Hao Guo; Dongmei Lv; Yun Wang
Journal:  Mol Med Rep       Date:  2019-06-06       Impact factor: 2.952

3.  Bacterial low-abundant taxa are key determinants of a healthy airway metagenome in the early years of human life.

Authors:  Marie-Madlen Pust; Burkhard Tümmler
Journal:  Comput Struct Biotechnol J       Date:  2021-12-15       Impact factor: 7.271

4.  A cancer graph: a lung cancer property graph database in Neo4j.

Authors:  David Tuck
Journal:  BMC Res Notes       Date:  2022-02-14

5.  Bioinformatic Analysis Combined With Experimental Validation Reveals Novel Hub Genes and Pathways Associated With Focal Segmental Glomerulosclerosis.

Authors:  Yan-Pei Hou; Tian-Tian Diao; Zhi-Hui Xu; Xin-Yue Mao; Chang Wang; Bing Li
Journal:  Front Mol Biosci       Date:  2022-01-04
  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.