Literature DB >> 24135261

Are graph databases ready for bioinformatics?

Christian Theil Have1, Lars Juhl Jensen.   

Abstract

Entities:  

Mesh:

Year:  2013        PMID: 24135261      PMCID: PMC3842757          DOI: 10.1093/bioinformatics/btt549

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


× No keyword cloud information.
Graphs are ubiquitous in bioinformatics and frequently consist of too many nodes and edges to represent in random access memory. These graphs are thus stored in databases to allow for efficient queries using declarative query languages such as Structured Query Language (SQL). Traditional relational databases (e.g. MySQL and PostgreSQL) have long been used for this purpose and are based on decades of research into query optimization. Recently, NoSQL databases have caught a lot of attention because of their advantages in scalability. The term NoSQL is used to refer to schemaless databases such as key/value stores (e.g. Apache Cassandra), document stores (e.g. MongoDB) and graph databases (e.g. AllegroGraph, Neo4J, OpenLink Virtuoso), which do not fit within the traditional relational paradigm. Most NoSQL databases do not have a declarative query language. The widely used Neo4J graph database is an exception (Webber ). Its query language Cypher is designed for expressing graph queries, but is still evolving. Graph databases have so far seen only limited use within bioinformatics (Schriml ). To illustrate the pros and cons of using a graph database (exemplified by Neo4J v1.8.1) instead of a relational database (PostgreSQL v9.1), we imported into both the human interaction network from STRING v9.05 (Franceschini ), which is an approximately scale-free network with 20 140 proteins and 2.2 million interactions. As all graph databases, Neo4J stores edges as direct pointers between nodes, which can thus be traversed in constant time. Because Neo4j uses the property graph model, nodes and edges can have properties associated with them; we use this for storing the protein names and the confidence scores associated with the interactions (Fig. 1). In PostgreSQL, we stored the graph as an indexed table of node pairs, which can be traversed with either logarithmic or constant look up complexity depending on the type of index used. On these databases we benchmarked the speed of Cypher and SQL queries for solving three bioinformatics graph processing problems: finding immediate neighbors and their interactions, finding the best scoring path between two proteins and finding the shortest path between them. We have selected these three tasks because they illustrate well the strengths and weaknesses of graph databases compared with traditional relational databases.
Fig. 1.

Relational versus graph database representation of a small protein interaction network. In the relational database, the network is stored as an interactions table (left). By contrast a graph database directly stores interactions as pointers between protein nodes (right). Below, we show the queries to identify second-order interaction partners in SQL and Cypher, respectively

A common task in STRING is to retrieve a neighbor network. This involves finding the immediate neighbors of a protein and all interactions between them. To express this as a single SQL query requires the use of query nesting and a UNION set operation. Because Cypher currently supports neither of these features, two queries are needed to solve the task: one to find immediate neighbors and a second to find their interactions, which must be run for each of the immediate neighbors. Although this precludes some query optimizations, running all these Cypher queries is 36× faster than running the single SQL query (Table 1). However, it should be noted that a 49× fold speedup is attainable with PostgreSQL by similarly decomposing the complex query into multiple simple SQL queries. In theory, posing the task as one declarative query maximizes the opportunity for query optimization, but in practice this does not always give good performance. These results also show that even for graph data, using a graph database is not necessarily an advantage.
Table 1.

Query benchmark of a relational and a graph database

Neighbor networkBest-scoring pathShortest path
PostgreSQL206.31 s1147.74 s976.22 s
Neo4j5.68 sa1.17 s0.40 s
Speedup36×981×2441×

Note: For each of three selected tasks, we ran the corresponding queries for randomly selected human proteins/protein pairs and report the average time. We used a Linux machine equipped with a 3GHz quad-core Intel Core i3 processor, 4 GB random access memory and a 250 GB 7200 rpm hard drive.

aNeighbor networks cannot be expressed as a single Cypher query. Instead we report the total time of all queries involved in solving this task. Similar speedup was observed for PostgreSQL when similarly decomposing the complex query into multiple simple queries.

Query benchmark of a relational and a graph database Note: For each of three selected tasks, we ran the corresponding queries for randomly selected human proteins/protein pairs and report the average time. We used a Linux machine equipped with a 3GHz quad-core Intel Core i3 processor, 4 GB random access memory and a 250 GB 7200 rpm hard drive. aNeighbor networks cannot be expressed as a single Cypher query. Instead we report the total time of all queries involved in solving this task. Similar speedup was observed for PostgreSQL when similarly decomposing the complex query into multiple simple queries. Finding the best scoring path in a weighted graph is another frequently occurring task. For example, finding the best scoring path connecting two proteins in the STRING network is a crucial part of the NetworKIN algorithm (Linding ). This task can be expressed single query both in (recursive) SQL and in Cypher. However, in practice neither query can be executed unless the maximal path length is severely constrained, in which case the Cypher query was faster by a factor of 981× (Table 1). The poor scalability is because of an exponential explosion in the number of longer paths, which in part is because of the scale-free nature of the network. The task can be efficiently solved using Dijkstra’s algorithm, but neither database is capable of casting queries as dynamic programming problems, although promising results have been achieved with automatic dynamic programming in declarative languages (Zhou ). By contrast, the Cypher graph query language has a dedicated function for finding shortest paths, not taking into account edge weights. This leads to a massive speed improvement for this specific task: Neo4j is able to find the shortest path with no length constraint 2441× faster than PostgreSQL can find the shortest path when constraining the maximal path length to two edges. This shows what is possible when tightly integrating efficient algorithms with graph databases. In summary, graph databases themselves are ready for bioinformatics and can offer great speedups over relational databases on selected problems. The fact that a certain dataset is a graph, however, does not necessarily imply that a graph database is the best choice; it depends on the exact types of queries that need to be performed. Graph queries formulated in terms of paths can be concise and intuitive compared with equivalent SQL queries complicated by joins. Nevertheless, declarative graph query languages leave much to be desired, both feature-wise and performance-wise. Relational databases are a better choice when set operations are needed. Such operations are not as natural a fit to graph databases and have yet to make it into declarative graph database query languages. These languages are efficient for basic path traversal problems, but to realize the full benefits of using a graph database, it is presently necessary to tightly integrate the relevant algorithms with the graph database. Relational versus graph database representation of a small protein interaction network. In the relational database, the network is stored as an interactions table (left). By contrast a graph database directly stores interactions as pointers between protein nodes (right). Below, we show the queries to identify second-order interaction partners in SQL and Cypher, respectively Conflict of Interest: none declared.
  3 in total

1.  Systematic discovery of in vivo phosphorylation networks.

Authors:  Rune Linding; Lars Juhl Jensen; Gerard J Ostheimer; Marcel A T M van Vugt; Claus Jørgensen; Ioana M Miron; Francesca Diella; Karen Colwill; Lorne Taylor; Kelly Elder; Pavel Metalnikov; Vivian Nguyen; Adrian Pasculescu; Jing Jin; Jin Gyoon Park; Leona D Samson; James R Woodgett; Robert B Russell; Peer Bork; Michael B Yaffe; Tony Pawson
Journal:  Cell       Date:  2007-06-14       Impact factor: 41.582

2.  Disease Ontology: a backbone for disease semantic integration.

Authors:  Lynn Marie Schriml; Cesar Arze; Suvarna Nadendla; Yu-Wei Wayne Chang; Mark Mazaitis; Victor Felix; Gang Feng; Warren Alden Kibbe
Journal:  Nucleic Acids Res       Date:  2011-11-12       Impact factor: 16.971

3.  STRING v9.1: protein-protein interaction networks, with increased coverage and integration.

Authors:  Andrea Franceschini; Damian Szklarczyk; Sune Frankild; Michael Kuhn; Milan Simonovic; Alexander Roth; Jianyi Lin; Pablo Minguez; Peer Bork; Christian von Mering; Lars J Jensen
Journal:  Nucleic Acids Res       Date:  2012-11-29       Impact factor: 16.971

  3 in total
  15 in total

1.  Graphery: interactive tutorials for biological network algorithms.

Authors:  Heyuan Zeng; Jinbiao Zhang; Gabriel A Preising; Tobias Rubel; Pramesh Singh; Anna Ritz
Journal:  Nucleic Acids Res       Date:  2021-07-02       Impact factor: 16.971

2.  biochem4j: Integrated and extensible biochemical knowledge through graph databases.

Authors:  Neil Swainston; Riza Batista-Navarro; Pablo Carbonell; Paul D Dobson; Mark Dunstan; Adrian J Jervis; Maria Vinaixa; Alan R Williams; Sophia Ananiadou; Jean-Loup Faulon; Pedro Mendes; Douglas B Kell; Nigel S Scrutton; Rainer Breitling
Journal:  PLoS One       Date:  2017-07-14       Impact factor: 3.240

Review 3.  Trends in IT Innovation to Build a Next Generation Bioinformatics Solution to Manage and Analyse Biological Big Data Produced by NGS Technologies.

Authors:  Alexandre G de Brevern; Jean-Philippe Meyniel; Cécile Fairhead; Cécile Neuvéglise; Alain Malpertuy
Journal:  Biomed Res Int       Date:  2015-06-01       Impact factor: 3.411

4.  Think globally and solve locally: secondary memory-based network learning for automated multi-species function prediction.

Authors:  Marco Mesiti; Matteo Re; Giorgio Valentini
Journal:  Gigascience       Date:  2014-04-23       Impact factor: 6.524

5.  STON: exploring biological pathways using the SBGN standard and graph databases.

Authors:  Vasundra Touré; Alexander Mazein; Dagmar Waltemath; Irina Balaur; Mansoor Saqi; Ron Henkel; Johann Pellet; Charles Auffray
Journal:  BMC Bioinformatics       Date:  2016-12-05       Impact factor: 3.169

6.  GeNNet: an integrated platform for unifying scientific workflows and graph databases for transcriptome data analysis.

Authors:  Raquel L Costa; Luiz Gadelha; Marcelo Ribeiro-Alves; Fábio Porto
Journal:  PeerJ       Date:  2017-07-05       Impact factor: 2.984

7.  Systematic integration of biomedical knowledge prioritizes drugs for repurposing.

Authors:  Daniel Scott Himmelstein; Antoine Lizee; Christine Hessler; Leo Brueggeman; Sabrina L Chen; Dexter Hadley; Ari Green; Pouya Khankhanian; Sergio E Baranzini
Journal:  Elife       Date:  2017-09-22       Impact factor: 8.140

8.  GAIL: An interactive webserver for inference and dynamic visualization of gene-gene associations based on gene ontology guided mining of biomedical literature.

Authors:  Daniel Couch; Zhenning Yu; Jin Hyun Nam; Carter Allen; Paula S Ramos; Willian A da Silveira; Kelly J Hunt; Edward S Hazard; Gary Hardiman; Andrew Lawson; Dongjun Chung
Journal:  PLoS One       Date:  2019-07-01       Impact factor: 3.240

9.  Reactome graph database: Efficient access to complex pathway data.

Authors:  Antonio Fabregat; Florian Korninger; Guilherme Viteri; Konstantinos Sidiropoulos; Pablo Marin-Garcia; Peipei Ping; Guanming Wu; Lincoln Stein; Peter D'Eustachio; Henning Hermjakob
Journal:  PLoS Comput Biol       Date:  2018-01-29       Impact factor: 4.475

10.  A Survey of Systematic Evidence Mapping Practice and the Case for Knowledge Graphs in Environmental Health and Toxicology.

Authors:  Taylor A M Wolffe; John Vidler; Crispin Halsall; Neil Hunt; Paul Whaley
Journal:  Toxicol Sci       Date:  2020-05-01       Impact factor: 4.849

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.