| Literature DB >> 22165854 |
Shelton D Griffith1, Daniel J Quest, Thomas S Brettin, Robert W Cottingham.
Abstract
BACKGROUND: Biology is rapidly becoming a data intensive, data-driven science. It is essential that data is represented and connected in ways that best represent its full conceptual content and allows both automated integration and data driven decision-making. Recent advancements in distributed multi-relational directed graphs, implemented in the form of the Semantic Web make it possible to deal with complicated heterogeneous data in new and interesting ways.Entities:
Mesh:
Year: 2011 PMID: 22165854 PMCID: PMC3236839 DOI: 10.1186/1471-2105-12-S10-S17
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1A modular view of the BioSITES architecture. Subsystems (Detectors, Data Stores, Content Delivery, and the Kernel) are all required to realize systems for monitoring streaming data. The BioSITES Kernel is at the heart of the architecture, providing monitoring capabilities. Each other subsystem requires a functioning kernel to run. Subsystems interact with the BioSITES kernel through clearly defined interfaces.
Figure 2A runtime view of the BioSITES system. BioSITES is achieved by connecting software components (the circles in the figure) from multiple institutions. These software components are connected via the message oriented middleware (MOM), ActiveMQ. MOM uses the publish-subscribe pattern to connect software components. Data is forwarded from one component to another when the second component subscribes to the first. BioSITES components include detectors (e.g. DNA sequencers and web crawlers), Routers (components that route data appropriately in the system based on the metadata), Sensors (analysis routines or workflows that identify and filter important signals in the data stream), Controllers (extensions of the Semantic Web that contain decision logic to process and integrate workflow outputs), and Advisories (components that contain reporting capabilities for displaying and relaying information). The double lines in the figure illustrate network boundaries. Data storage and access in the system is constrained by these network boundaries. Each component of the system is aware of all upstream components through messages that flow through the system.
Figure 3A conceptual view of where data is stored in the BioSITES data store. Data in streams is analysed in near real time and is not stored. RDF Statements found on the Semantic Web are each stored in repositories owned by each institution. Portions of these repositories may also be cached in the Scenario RDF stores to improve performance. RDF Statements created by analysis are stored in RDF Stores distributed across the BioSITES system, with each store corresponding to one or a few scenarios. Catalogs required by a scenario (e.g. a set of sequences) must be converted to RDF, and then made available for traversal by software.
Figure 4A simplified view of the results from the SDDM process. Five different ontologies are used in the construction of this multi-relationship graph: the Scenario Ontology (abbreviated with a colon, :), Bio2RDF (abbreviated with Bio2RDF_X where X is a namespace tag e.g. if X = PubMed, then Bio2RDF_PubMed = http://bio2rdf.org/pubmed), GeoNames, ProMed, and DBPedia. The SDDM process created all edges in the example, and all nodes in black and white. Bio2RDF (Blue), GeoNames (Green), and DBPedia (Purple) nodes already existed on the Semantic Web. Red nodes representing ProMed mail articles are integrated using an RDFizer as part of the SDDM process. The NDM-1 example created RDF individuals corresponding to Wikipedia articles, InterProScan models, PubMed articles, ProMed articles, Gene Ontology terms, a scenario, catalogs, events, outbreaks, NCBI sequences (gi numbers), and GeoName locations. The nodes and edges diagrammed in this figure are physically stored in 5 locations: Bio2RDF nodes marked in blue are stored on the main Bio2RDF server or an acceptable mirror, GeoNames nodes marked in green are stored on the GeoNames infrastructure, DBPedia nodes marked in purple are stored on the DBPedia servers or a mirror, ProMed nodes marked in red are stored in a local graph database (Neo4j), and nodes and relations identified in the scenario, marked in black, are stored in a local RDF database on the machine where the scenario is built.
Figure 5A multi-relational multi-ontology directed graph representing the output from an analytical routine (BioSITES sensor). RDF Statements represent the results from an analysis (statements are each edge in blue). Statements are also made between entities of the multi-relational directed graph, Literals (blue squares) and ontologies (Scenario, bio2rdf, and GeoNames). Ontologies (dashed rectangles) also contain relationships between classes (rounded rectangles). An analysis is responsible for linking bio2rdf entities to samples. Samples correspond to events and events have locations.
Namespaces and abbreviations used in the text.
| Name | URI | Description | Reference |
|---|---|---|---|
| DBpedia | Ontology of wikipedia entries | DBpedia: A Nucleus for a Web of Open Data | |
| SO [sequence ontology] | Ontology of genetic sequences | The Sequence Ontology: a tool for the unification of genome annotations | |
| GO [gene ontology] | Ontology of biological information | Gene ontology: tool for the unification of biology. The Gene Ontology Consortium | |
| Bio2RDF | Converts biological info into RDF | Bio2RDF: Towards a mashup to build bioinformatics knowledge systems | |
| Scenario | Ontology of the BioSITES terms | ||
| Interpro | Database of protein sequences | InterProScan – an integration platform for the signature-recognition methods in InterPro | |
| ProMed | Collection of public health articles | ProMed-mail: An Early Warning System for Emerging Diseases | |
| GeoNames | Database of GPS locations around the world | Interlinking Open Data on the Web |
Technologies used in the SDDM process.
| Technology name | Download site | Description |
|---|---|---|
| ActiveMQ | Used to connect each of the following BioSITES software components: Detectors, Sensors, Routers, and Controllers. | |
| Rexter | Rexter exposes RDF over REST. | |
| JAXB | Parses the contents of the XML-metadata into java objects, place the content of these objects into a message, contact the broker, and forward a message to downstream routers. | |
| Neo4J | Used to import the ProMED mail archive for easy searching. | |
| OrientDB | A NoSQL DBMS which can store 150000 documents per second on common hardware | |
| DEX | A high-performance | |
| Jena’s TDB | ||
| Jena | An API for the manipulation of RDF data. | |
| Sesame | An API for the manipulation of RDF | |
| AllegroGraph | A system to load, store and query RDF data. | |
| Virtuoso | Virtuoso is an innovative enterprise grade multi-model data server for agile enterprises & individuals. | |
| Gremlin | Gremlin is a graph traversal language | |
| Protege | A free open-source Java tool providing an extensible architecture for the creation of customized knowledge-based applications | |
| Tomcat | Apache Tomcat is an open source software implementation of the Java Servlet and JavaServer Pages technologies. | |
| SPARQL | A relational like query interface to SQL Data |