| Literature DB >> 26656740 |
Davide Alocci1,2, Julien Mariethoz1, Oliver Horlacher1,2, Jerven T Bolleman3, Matthew P Campbell4, Frederique Lisacek1,2.
Abstract
Resource description framework (RDF) and Property Graph databases are emerging technologies that are used for storing graph-structured data. We compare these technologies through a molecular biology use case: glycan substructure search. Glycans are branched tree-like molecules composed of building blocks linked together by chemical bonds. The molecular structure of a glycan can be encoded into a direct acyclic graph where each node represents a building block and each edge serves as a chemical linkage between two building blocks. In this context, Graph databases are possible software solutions for storing glycan structures and Graph query languages, such as SPARQL and Cypher, can be used to perform a substructure search. Glycan substructure searching is an important feature for querying structure and experimental glycan databases and retrieving biologically meaningful data. This applies for example to identifying a region of the glycan recognised by a glycan binding protein (GBP). In this study, 19,404 glycan structures were selected from GlycomeDB (www.glycome-db.org) and modelled for being stored into a RDF triple store and a Property Graph. We then performed two different sets of searches and compared the query response times and the results from both technologies to assess performance and accuracy. The two implementations produced the same results, but interestingly we noted a difference in the query response times. Qualitative measures such as portability were also used to define further criteria for choosing the technology adapted to solving glycan substructure search and other comparable issues.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26656740 PMCID: PMC4684231 DOI: 10.1371/journal.pone.0144578
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Glycan CFG encoding and graph encoding.
On the left hand side a glycan structure encoded with CFG nomenclature is presented, while the right hand side shows the same structure translated into a graph. Each monosaccharide or substituent becomes a node and each glycosidic bond becomes an edge in the graph. Avoiding any loss of information all the properties of each monosaccharide or substituent are converted in node properties whereas glycosidic bond properties are translated in edge properties. To be more clear the colour code associate with the monosaccharide type is preserved among the images.
Fig 2Ontology overview.
Overview of the ontology developed for translating glycan structures into RDF/semantic triples. The figure shows all the predicates and the entities used for defining a glycan structures into the RDF triple store.
Fig 3Query building example.
A. Example of use of the RDF model to build a SPARQL query from a glycan substructure focussing on the translation process. The prefix part of the query is omitted but further detailed examples are provided in the S1 File. B. The same example is shown with building a Cypher query, the native language in Neo4J. Similarly, additional examples are provided in the S1 File.
Fig 4Average query time.
The mean value calculated on the response times of each query in both sets is shown in two bar charts. Panel (A) shows the mean query times for the first set and panel (B) contains the values for the second set. The column assign to Virtuoso in the second set of query is empty because we could not record any data due to a problem in running large and very large queries.
Fig 52D structure of query glycoepitopes.
The 2-dimensional structure of three well-known glycoepitopes listed in Table 1, namely (A) Lactosamine Type One. (B) Blood Group A. (C) Sialyl Lewis X is shown. Response time for each is shown in Table 1.
Comparison of average response query time for 3 glycoepitopes (see Fig 5).
A comparison of the average query time of Property Graph and RDF triple store databases tested in this study (columns) for three well known epitopes (rows). SPARQL and Cypher queries for these glycoepitopes are provided in the S1 File.
| Virtuoso | Neo4j | Sesame | Jena Fuseki | Blazegraph | |
|---|---|---|---|---|---|
|
| 0.114 s | 0.827 s | 0.814s | 0.941 s | 0.495 s |
|
| 0.104 s | 0.945 s | 0.817 s | 0.869 s | 0.225 s |
|
| 0.469 s | 1.012 s | 0.706 s | 1.970 s | 0.120 s |