| Literature DB >> 20105306 |
Javier Garcia-Garcia1, Emre Guney, Ramon Aragues, Joan Planas-Iglesias, Baldo Oliva.
Abstract
BACKGROUND: The analysis and usage of biological data is hindered by the spread of information across multiple repositories and the difficulties posed by different nomenclature systems and storage formats. In particular, there is an important need for data unification in the study and use of protein-protein interactions. Without good integration strategies, it is difficult to analyze the whole set of available data and its properties.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20105306 PMCID: PMC3098100 DOI: 10.1186/1471-2105-11-56
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Comparison of network analysis platforms.
| Feature | CY | GM | VA | OS | CD | AR | IN | GG | PI | PR | BL | PA | BI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Free for academic use | X | X | X | X | X | X | X | X | X | ||||
| Free for commercial use | X | X | X | X | X | X | X | ||||||
| Open source | X | X | X | X | X | ||||||||
| Curated pathway/network content | X | X | X | X | X | ||||||||
| Standard file format support | X | X | X | X | X | X | |||||||
| User-defined networks/pathways | X | X | X | X | X | X | X | X | X | X | X | X | |
| Functionality to infer new pathways | X | X | X | X | X | ||||||||
| GO/pathway enrichment analysis | X | X | X | X | X | ||||||||
| Automated graph layout | X | X | X | X | X | X | X | X | X | X | |||
| Complex criteria for visual properties | X | X | X | X | X | X | X | ||||||
| Multiple visual styles | X | X | X | X | X | X | |||||||
| Advanced node selection | X | X | X | X | X | X | X | X | X | X | |||
| Customizable gene/protein database | X | X | X | X | X | ||||||||
| Rich graphical annotation | X | X | X | X | X | ||||||||
| Statistical network analysis | X | X | X | X | X | X | |||||||
| Extensible functionality: plugins or API | X | X | X | X | X | X | X | ||||||
| Quantitative pathway simulation | X | X | |||||||||||
BIANA has been compared with the same programs and using the same set of features as the ones presented in [21]. Compared software: CY, Cytoscape [55]; GM, GenMAPP [58]; VA, VisANT [59]; OS, Osprey [60]; CD, CellDesigner [61]; AR, Ariadne Genomics Pathway Studio [62]; IN, Ingenuity Pathways Analysis http://www.ingenuity.com; GG, GeneGO http://www.genego.com; PI, PIANA [13]; PR, ProViz [63]; BL, BioLayout [64]; PA, PATIKA [65]; BI, BIANA.
Comparison of biological information integration softwares.
| Feature | BI | PI | AP | AP2 | BN | UH | MI | ON | iRI | |
|---|---|---|---|---|---|---|---|---|---|---|
| Supports multiple biomolecule types (protein, gene, compound...) | X | X | X | X | ||||||
| Supports multiple relation types (interaction, complex, pathway...) | X | X | X | X | X | |||||
| Supports multiple data descriptor/identifiers types | X | X | X | X | X | X | X | X | ||
| X | ||||||||||
| X | (1) | |||||||||
| X | (1) | |||||||||
| Standalone Graphical Interface | X | X | ||||||||
| Scripting/Command line | X | X | X | X | ||||||
| Provides a webserver | X | X | X | X | X | X | ||||
| Provides a plugin for Cytoscape | X | X | X | X | ||||||
| Adds network analysis methods | X | X | X | X | ||||||
| Open Source | X | X | X | X | ||||||
| Does not require additional software | X | X | X | X | X | X | ||||
| Standalone application (runs locally) | X | X | X | X | ||||||
BIANA has been compared with other biological databases integration software/webservers. Compared software: PI, PIANA [13]; AP, APID [15]; AP2, APID2NET [66]; BN, BNDB [14]; UH, UniHI [67]; MI, MIMI [68], ON, ONDEX [18], iRI, iRefIndex [19]. (1)According to the original manuscript, "The installation and use of the data integration methods is still command line driven and requires technical expertise to install, configure and use this component of the ONDEX system".
Figure 1BIANA Architecture. BIANA is composed of 4 different modules: Database Module, Parser Module, Network Module and Session Management Module. Database Module handles communication between BIANA and MySQL database. Parser Module imports data into BIANA database. Network Module performs all network procedures using NetworkX package. Session Management Module to handle biological data sets and networks. BIANA Cytoscape Plugin is a separate interface that communicates Cytoscape with BIANA through a socket. BIANA framework can be executed with Python interpreter (as well as command line python scripts) or in Cytoscape with a plugin.
Figure 2BIANA Data Model. A) BIANA Data Model Diagram. Schematic UML (Unified Modeling Language) representation of the data entries and their relationships in BIANA. Explanation of each element is given in the text. B) BIANA Database Architecture.
Figure 3BIANA Workflow. BIANA working procedure involves at least 3 steps: 1) Install BIANA package and Cytoscape plugin if required; 2) Populate BIANA database and create unification protocols; and 3) Start a working session.
Default external database parsers provided by BIANA.
| External Database | Details |
|---|---|
| Uniprot [ | Protein sequence, identifiers and functional information (domain composition, description, function...). Both Swiss-prot (manually curated) and TrEMBL (automatically annotated) can be inserted into BIANA. |
| GenPept from GenBank [ | Protein sequences translated from the GenBank database. GenBank is the NIH genetic sequence database, a collection of all publicly available DNA. |
| Non-redundant Blast Database (FASTA formatted file) (August 2008) | BLAST Non-redundant database from NCBI. Non-redundant protein sequence database with entries from GenPept, SwissProt, PIR, PDF, PDB and NCBI RefSeq. |
| International Protein Index (IPI) [ | Integrated database for proteomics experiments. |
| HUGO Gene Nomenclature Committee (HGNC) (September 2008) | Approved unique gene symbols for each human gene. |
| Cluster of Orthologous Genes (COGs) [ | Collection of orthologous protein sets for prokaryotes and eukaryotes. |
| Gene Ontology (GO) [ | The Gene Ontology provides a controlled vocabulary to describe gene and gene product attributes in any organism. It allows to link in BIANA between |
| PSI-MI obo | Controlled vocabulary and ontology for molecular interactions and their detection methods. Provides the information about and the relation between |
| NCBI Taxonomy [ | The NCBI taxonomy database contains the names of all organisms that are represented in the genetic databases. It allows to link between |
| Structural Classification of Proteins (SCOP) [ | Manually curated database with a comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known. It has a hierarchical classification of the structural domains. |
| PSI-MI 2.5 Format [ | Data exchange format for molecular interactions. The following protein-protein interaction databases can be inserted into BIANA: IntAct [ |
| Biopax Level 2 Format | Data exchange format for biological pathway data. The following databases can be inserted into BIANA: Reactome [ |
| iRefIndex [ | A consolidated protein interaction database with provenance. (April 2009) |
| Kyoto Encyclopedia of Genes and Genomes (KEGG) [ | Kegg Ligand (chemical compounds, drugs, glycans and reactions), Kegg genes (genomes, genes and proteins) and Kegg orthology (ortholog annotation) are inserted into BIANA. |
| STRING [ | Database of known and predicted protein interactions. Includes direct (physical) and indirect (functional) associations. |
BIANA provides the following parsers for common public biologic databases. Updated database parsers can be uploaded in the project webpage http://sbi.imim.es/web/BIANA.php.
Figure 4BIANA networks created from user provided datasets. BIANA Cytoscape plugin has been used to generate the relation networks of three different user specific datasets (data is available at http://sbi.imim.es/web/BIANA.php). Represented entities are: Proteins (green nodes), genes (blue nodes), interactions and metabolic relations (blue edges) and cooperation (red edges). A) Metabolic network reconstruction where a relation is established and scored between each pair of possible chained enzymatic reactions. A chained reaction between enzyme A and enzyme B is possible when there is at least one chemical compound in the intersection, acting at the same time as product of enzyme A and substrate of enzyme B. The network has been filtered with score greater than 1.2 (score is based on the plausibility of observing chemical compounds in the intersection, according to their own frequency and the frequency of other products of enzyme A and other substrates of enzyme B that do not take part in chaining reactions). B) Protein-protein interaction network predicted from sequences/structure distant patterns as described in Espadaler et al. [22]. Only human proteins are shown and interactions coming from set I3 (see a detailed explanation in the original work). C) Network representation of cooperative transcription factors and their regulated genes described at Aguilar et al. [23]. Only transcription factors cooperating with others have been represented.
Recommended unification protocols.
| External Databases | Attributes (identifiers) |
|---|---|
| Uniprot, GeneBank, IPI, KeggGene, COG, String | ProteinSequence AND taxID |
| Uniprot, HGNC, HPRD, DIP, MPACT, Reactome, IPI, BioGrid, MINT, IntAct, String | UniprotAccession |
| Uniprot, String | UniprotEntry |
| Uniprot, HGNC, HPRD, DIP, String | GeneID |
| Uniprot, SCOP(promiscuous) | PDB |
List of external databases and the attributes (identifiers) proposed to be used in a unification protocol.
Figure 5BIANA Unification. Example where three different unification protocols are applied to three external databases (each external database is represented with a different color). BIANA network nodes are individual user entities. Each user entity consists of a set of equivalent external entities. Each external entity can belong to a single user entity, unless the database is defined as promiscuous database, where a single external entity can belong to multiple user entities. External entities in promiscuous databases can not form a user entity by themselves. In this theoretical example, in order to show the importance of the unification protocol, it can be observed Prot1A is merged with Prot1B when unifying by UniprotAccession identifier, while they are not merged if unification is done by Sequence and taxonomyID. However, when unified by UniprotAccession or geneSymbol, Prot1A, Prot1B, Prot2A and Prot2B are merged.
Comparison of three different networks at level 1.
| Disease | Keywords | Initial Set | PPI | PPI + inferred interactions |
|---|---|---|---|---|
| Cancer | Cancer, tumor, metastasis | 985 (93) | 2782 (251) | 6272 (489) |
| Diabetes | Diabetes | 86 (10) | 284 (19) | 2121 (54) |
| Alzheimer | Alzheimer | 30 (4) | 138 (6) | 1098 (12) |
Comparison of three different networks at level 1 using reported protein-protein interactions vs. using inferred interactions by sequence homology. A BIANA database has been created using the following databases: Uniprot Swissprot, IntAct, MINT, BioGrid, DIP and HPRD. Three different initial data sets related with three different pathologies have been created by a keyword search in fields Disease, Keyword, Description and Function. Two networks at level 1 have been created for each set: 1) using reported protein-protein interactions by third-party databases and 2) using inferred interactions by using sequence similarity (see text for details). For each network we calculated the number of proteins involved in the pathologies according to HEFalMp [75] with a p < 0.00001 (shown in parenthesis). By using inferred interactions a higher number of candidates are retrieved.