| Literature DB >> 23946492 |
Liam G Fearnley, Melissa J Davis, Mark A Ragan, Lars K Nielsen.
Abstract
Large quantities of information describing the mechanisms of biological pathways continue to be collected in publicly available databases. At the same time, experiments have increased in scale, and biologists increasingly use pathways defined in online databases to interpret the results of experiments and generate hypotheses. Emerging computational techniques that exploit the rich biological information captured in reaction systems require formal standardized descriptions of pathways to extract these reaction networks and avoid the alternative: time-consuming and largely manual literature-based network reconstruction. Here, we systematically evaluate the effects of commonly used knowledge representations on the seemingly simple task of extracting a reaction network describing signal transduction from a pathway database. We show that this process is in fact surprisingly difficult, and the pathway representations adopted by various knowledge bases have dramatic consequences for reaction network extraction, connectivity, capture of pathway crosstalk and in the modelling of cell-cell interactions. Researchers constructing computational models built from automatically extracted reaction networks must therefore consider the issues we outline in this review to maximize the value of existing pathway knowledge.Entities:
Keywords: databases; modelling; reaction networks; signal transduction
Mesh:
Year: 2013 PMID: 23946492 PMCID: PMC4239801 DOI: 10.1093/bib/bbt058
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Formats implemented by major databases
| Database | BioPAX L3 | BioPAX L2 | SBML | PSI-MITAB | Custom format | API |
|---|---|---|---|---|---|---|
| Reactome | ✓ | ✓ | ✓ | ✓ | MySQL dump | ✓ |
| PANTHER Pathway | ✓ | ✗ | ✓ | ✓ | CellDesigner-compatible SBML | ✗ |
| KEGG | ✗ | ✗ | ✗ | ✗ | KGML | ✓ |
| NCI-PID | ✓ | ✓ | ✗ | ✗ | PID XML | ✗ |
| PathwayCommons | ✓ | ✗ | ✗ | ✗ | SIF, tab-delimited | ✓ |
We discuss data sourced from four major databases (Reactome, PANTHER Pathways, KEGG, NCI PID) and a meta-database that aggregates information from multiple sources (Pathway Commons). These databases make their data available both through a graphical web-based interface (with associated diagrammatic representations) and in numerous community-specified and custom formats, as well as through APIs.
Figure 1:Duplication of entities decreases network connectivity. It is essential that each entity in a given cellular location is represented with a single entry in the underlying database. In this visualization, each node in the network refers to a unique database entry. In (A), the entity represented by a star has been duplicated (solid and dashed outlines). This significantly reduces the connectivity and complexity of the network described by the data. (B) shows a network consisting of multiple signal transduction pathways implicated in prostate cancer visualized from data originally sourced from the PANTHER Pathways database and analysed in [17]. This network has duplication of 28 entities. Correcting these duplications, as illustrated in (C) yields the network shown in (D), with an attendant increase in connectivity and complexity.
Frequency of occurrence of meta-entities and non-flat complexes
| Database | Total number of entities | Meta-entities | Number of complexes | Number of recursive complexes |
|---|---|---|---|---|
| Reactome | 24 477 | 2419 | 6040 | 3485 |
| KEGG | 25 043 | 4716 | – | – |
| PANTHER | 13 241 | Unlabelled1 | 913 | 34 |
| NCI-PID | 27 367 | Approx 9602 | 9016 | 2751 |
Both meta-entities and non-flat complexes are common in the major signal transduction databases discussed. *KEGG data sourced from their REST API, downloaded as KGML files for interpretation. 1Panther Pathways does not annotate their sets of entities–this is only evident from entity names [e.g. ‘C-jun-amino-terminal kinase 1, 2, and 3 (JNK1–3)’ as specified in the ‘FGF signalling pathway’] 2NCI-PID does not annotate their sets of entities with a clear marker—an estimate was made by counting the number of entities with cross-references to multiple proteins (e.g. ‘GRP1 family’ from the ‘Arf6 signaling events pathway’).
Figure 2:Bucketing of entities has a significant effect on networks. In (A), three entities have been grouped into a meta-entity (dashed circle), which interacts with the species described by the star. One of the entities has a number of distinct separate activities outside of this group. The network depicted in (B) is sourced from Reactome's ‘Mitotic G1-G1/S phases' pathway (REACT_21267.3). The BioPAX Level 3 representation of this pathway contains 27 of these meta-entities. Removing the meta-entity, as illustrated in (C) results in significant changes to the network shape. Restoration of connectivity lost owing to meta-entity use generates the network shown in (D), significantly changing network topology.
Figure 3:Multicellular interactions present problems in the absence of a defined cellular frame of reference. (A) shows an example system with cellular locations defined solely with respect to the cytosol, cell membrane and extracellular region of an unspecified cell. This representation generates ambiguity and is misleading when describing multicellular interactions—the same set of reactions can lead to significantly different functional capabilities of the interaction network when this is accurately represented (C). The example in (B) is sourced from Reactome’s ‘Latent infection of H. sapiens with M. tuberculosis’ pathway (REACT_121237.2). In the version of the network described in the database, the ‘cell wall’, ‘periplasmic space’, and ‘plasma membrane’ locations can be assigned to Mycobacterium (green) and ‘phagocytic vesicle membrane’ and ‘late endosome membrane’ to H. sapiens (blue). The more generic ‘cytosol’ is ambiguous (orange), and reactions assigned to this location could belong to either species. Fixing these assignments (using the graphical representation of the pathway) yields the unambiguous representation shown in (D). A colour version of this figure is available at BIB online: http://bib.oxfordjournals.org.