| Literature DB >> 21375767 |
Dieter Maier1, Wenzel Kalus, Martin Wolff, Susana G Kalko, Josep Roca, Igor Marin de Mas, Nil Turan, Marta Cascante, Francesco Falciani, Miguel Hernandez, Jordi Villà-Freixa, Sascha Losko.
Abstract
BACKGROUND: To enhance our understanding of complex biological systems like diseases we need to put all of the available data into context and use this to detect relations, pattern and rules which allow predictive hypotheses to be defined. Life science has become a data rich science with information about the behaviour of millions of entities like genes, chemical compounds, diseases, cell types and organs, which are organised in many different databases and/or spread throughout the literature. Existing knowledge such as genotype-phenotype relations or signal transduction pathways must be semantically integrated and dynamically organised into structured networks that are connected with clinical and experimental data. Different approaches to this challenge exist but so far none has proven entirely satisfactory.Entities:
Mesh:
Year: 2011 PMID: 21375767 PMCID: PMC3060864 DOI: 10.1186/1752-0509-5-38
Source DB: PubMed Journal: BMC Syst Biol ISSN: 1752-0509
Figure 1BioBridge specific BioXM data model. Partial visualisation of the BioBridge BioXM data model. A) Only the central objects and relations of the data model, such as "Gene", "Protein", "Expression Experiment" or "COPD pathway" are displayed. B) Data model configuration using the data model viewer context menu. New semantic objects are generated and edited directly within the graph viewer.
Figure 2Querying and visualising the knowledge network. Visualisation of the knowledge network retrieved by querying for all genes simultaneously showing fold change <-1.0, >1.0 in a COPD expression experiment set, associated with functional information inferred by GO "signal transduction" and connected to each other using the "shortest path" network search. A) Provides an overview of the full retrieved network. B) Focuses on a specific part of the interaction network and the context menu to add gene - disease and gene - compound information for the MAPK14 gene.
Figure 3Natural language like query wizard. Formulating a natural language query and following the query building steps in the wizard provides a simple way to generate complex queries which can be saved as a smart folder for re-use and other users. Query: "Find all Patients which simultaneously have anthropometric attributes Age >30, Sex male and BMI-BeforeTraining >= 10, are diagnosed with Smoking status = 1 and Packs per >= 10 and whose Physiology has been measured with Walk-AfterTraining >= 120 and Walk-BeforeTraining <= 100."
Figure 4BioXM architectural concept. A) The BioXM Java client-server application implements a modular architectural concept on the server side. Tasks such as resource and user management, reporting or graph layouting are implemented as individual modules. B) Inter-module communication and interaction with external applications, web services, browser and Java clients is based on standard protocols.
Fundamental semantic objects
| Semantic object | Description | Example |
|---|---|---|
| Element | Represents a basic unit of a knowledge model | "Gene" element type can be used to create the "STAT3" gene element "Disease term" element type can be used to create the "pancreatic tumor" disease term element |
| Relation | Describes a relationship between semantic objects | "Gene-disease" relation class can be used to create the "STAT3 is associated with disease pancreatic tumor" relation |
| Annotation | Extends the properties of | Gene report |
| Experiment | Performance optimized extension of an element by a set of attributes | Expression data |
| Ontology | Classifies semantic objects according to a defined hierarchical nomenclature of concepts | "3.2.2.21 DNA-3-methaladenine glycosidase II" entry is part of the "EC numbers" ontology |
| Context | Represents sets of semantic objects | Metabolic pathways Protein complexes A disease process or pattern |
| Database/external object | A basic unit of a knowledge model populated from an external application/database | dbSNP Sequence Variant Genome feature |
The fundamental semantic objects used in BioXM which allow a descriptive model of the world to be formulated and data resources to be related to that model. The semantic objects allow an extendable model to be defined from a set of well-defined building blocks (adapted from [30]).
COPD specific knowledge base
| Source database | Information type | Current statistics | Level of curation | Updates/Version |
|---|---|---|---|---|
| Protein interaction | 6256 Interactions | High throughput data submission and manually curated from literature | last public version 20.3.07 | |
| Protein interaction | 19 707 interactions | Manually curated from literature | updated monthly | |
| Enzyme kinetics | 4 729 | Manually curated from literature | updated monthly | |
| Compound information | 15 367 | Curated from different data sources | updated weekly | |
| Compound-gene, Compound-disease and Gene-disease relationships | 259 898 relations | Manually curated from literature | updated monthly | |
| Gene functional information | 80 793 human, mouse and rat genes | Curated information integrated from different databases, based on RefSeq genomes | updated weekly | |
| Enzyme related functional information | 4 833 | Manually curated from literature | updated weekly | |
| Functional genomics | >400 000 individual experiments | User submission | updated weekly | |
| Protein interaction | 21 584 binary interactions | Literature curation | updated weekly | |
| Pathways | 418 pathways | Manually curated from literature | updated monthly | |
| Compound information | 15 185 | Manually curated from the published literature | updated monthly | |
| Protein Interaction | 410 interactions | Manually curated from literature | current release 31.10.07 | |
| Gene - disease relations | 20 823 | Curated from literature | updated weekly | |
| Protein family information | 10 340 families | Manually curated from sequence alignments | 23.0 | |
| Interaction | >1.4 million | Automatic inference | last release 18.10.04 | |
| Compound information | >26 million | Automatic collection | updated weekly | |
| Literature abstracts | >19 million | Automatic collection with manual curation | updated weekly | |
| Interactions and pathways | >600 pathways, >24 000 | Manual curation | updated monthly | |
| DNA and protein sequences | >11 million | Automatic processing and manual curation | updated weekly | |
| Transcript sequences | >2 million | User submission followed by automatic clustering | updated monthly | |
| Protein sequences | >10 million | Automatic processing, Swissprot subsection manual curation | updated bi-weekly | |
Public data integrated into BioXM in the current version of the BioBridge COPD knowledge base.
Figure 5Populating the data model. Based on the given data model, the import wizard provides the selection of available import operations in the left frame. These are moved by drag-and-drop into the right frame where they form the import script which provides the mapping information between a data source and the data model. Here two elements from the data source are defined as type "Protein" and are referenced by their UniProt IDs. The relation between the two proteins is a "Protein interaction" from the Reactome data source and the associated evidence is stored as annotation.
Figure 6BioXM graphical user interface overview. The BioXM graphical user interface (GUI) consists of three frames. A Navigation bar provides the functions for importing, managing, reporting and searching data. A project and repositories frame to the left, allows all data available to a user to be accessed in the repositories section and the data to be organised in a user and project specific way in the projects section. A right frame, is used to display detailed information about any object selected in the left frame.
Figure 7Report with integrated external application result. This result view shows the first three significantly different pathways between sedentary and trained healthy people. The 3 D scatterplot on the top visualises the PC1 for each experiment as spot in the 3 D space with the KEGG pathways as dimensions. Green (experiment group 1, here pre-training) and red (experiment group 2, here post-training) spots clearly occupy two different regions of the plot, indicating differences. The significance of the differences is visible in the tabular report where the first column provides the name of the pathway. Column 2 and 3 list the PC1 values for each of the associated experiments in group 1 and 2. Columns 4 and 5 show the overall PC1 mean of the pre- and post-training data. The following columns list the t-, p- and adjusted p-value respectively.
Figure 8Network connecting inflammation with central metabolism. A) Individual nodes (compounds and proteins) are connected based on the shortest path algorithm. The initial compounds and proteins (marked in yellow) have been selected based on their involvement in pathways detected by PCA-pathway analysis and their significantly different concentration/activity in the COPD or training specific literature. Proteins and compounds are connected by following all possible relations within the COPD knowledge network which results in the most parsimonious network revealing a putative mechanistic connection between inflammation related processes and the central metabolism. B) Visualisation of a disease specific pathway association as a proxy for possible protein importance in a disease mechanism. Each protein in the network is queried for the number of associated disease specific pathways manually curated from the literature into the knowledge base. Numbers are visualised from light red (view pathways) to deep red (high number of pathways associated).