| Literature DB >> 24133454 |
Cristina Mitrea1, Zeinab Taghavi, Behzad Bokanizad, Samer Hanoudi, Rebecca Tagett, Michele Donato, Călin Voichiţa, Sorin Drăghici.
Abstract
The goal of pathway analysis is to identify the pathways significantly impacted in a given phenotype. Many current methods are based on algorithms that consider pathways as simple gene lists, dramatically under-utilizing the knowledge that such pathways are meant to capture. During the past few years, a plethora of methods claiming to incorporate various aspects of the pathway topology have been proposed. These topology-based methods, sometimes referred to as "third generation," have the potential to better model the phenomena described by pathways. Although there is now a large variety of approaches used for this purpose, no review is currently available to offer guidance for potential users and developers. This review covers 22 such topology-based pathway analysis methods published in the last decade. We compare these methods based on: type of pathways analyzed (e.g., signaling or metabolic), input (subset of genes, all genes, fold changes, gene p-values, etc.), mathematical models, pathway scoring approaches, output (one or more pathway scores, p-values, etc.) and implementation (web-based, standalone, etc.). We identify and discuss challenges, arising both in methodology and in pathway representation, including inconsistent terminology, different data formats, lack of meaningful benchmarks, and the lack of tissue and condition specificity.Entities:
Keywords: mathematical model; metabolic pathways; network topology; pathway analysis; signaling pathways; statistical significance; topology
Year: 2013 PMID: 24133454 PMCID: PMC3794382 DOI: 10.3389/fphys.2013.00278
Source DB: PubMed Journal: Front Physiol ISSN: 1664-042X Impact factor: 4.566
Figure 5Comparison of representative graph models for molecular interactions as used by different pathway databases. In a KEGG signaling pathway (A) nodes represent genes/gene products and edges represent regulatory signals such as activation, inhibition, phosphorylation, etc. (see http://www.genome.jp/kegg/document/help_pathway.html for details). In the chemical network representation of a KEGG metabolic pathway (B) the nodes represent biochemical compounds and edges represent chemical reactions. These chemical reactions are performed by enzymes which are proteins encoded by genes. Hence, in contrast with the signaling pathways in which genes are associated with nodes, in a metabolic pathways genes are associated with edges. This is the main reason most methods developed for signaling pathways cannot be applied directly to metabolic pathways. In an NCI-PID signaling pathway (C) nodes fall in two categories: component nodes representing biomolecular components, or process nodes representing biochemical reactions or biological processes. Edges connect two biomolecular components through a biochemical reaction or a biological process. Process nodes can have 3 states: positive regulation, negative regulation, or “involved in.” (see http://pid.nci.nih.gov/userguide/network_maps.shtml for details). In a protein-protein interaction network (D) nodes represent proteins and the interactions among them represent physical binding. These interactions can be inferred from two-hybrid assays and they may be either undirected (top), or directed from the bait protein to the prey protein (bottom). In the Biological Pathway Exchange (BioPAX) (E) nodes are physical entities and edges are conversions. BioPAX entities can represent complexes, DNA, proteins, RNA, small molecules, DNA regions or RNA regions. Conversions can represent biochemical reactions complex assembly or degradation, transport or transport with biochemical reaction. This model is very generic and increasingly flexible. It provides a standard for pathway information to be available in machine readable format, therefore easy to use for pathway analysis and to exchange between pathway databases (see http://www.biopax.org/release/biopax-level3-documentation.pdf for details).
Figure 1Gene sets are shows a small part of the MAPK signaling pathway from KEGG. This pathway shows the location of various genes or gene products (inside the cell, outside of it, or in the membrane), what gene interacts with what other gene(s), the type of each interaction (activation, repression, phosphorylation, etc.), the direction of the signal propagation, and potentially many other things (e.g., complex formation, etc.). (B) presents the same part of the same pathway as a gene set (no interactions). The gene set has lost all the structure and the additional information captured by the original pathway. This comparison shows how much important knowledge existent in pathway database is ignored when pathways are treated as simple gene sets.
Figure 2Generalized overview of the data flow in pathway analysis methods. For each module, the various options available for different methods surveyed, as well as the comparison criteria used in this paper are presented in the white boxes.
Figure 3Timeline showing when the surveyed pathway analysis tools, working mainly with signaling pathways, became available (this time may be different from publication time shown in Table Some of the methods use additional interaction information that may be from an in-house or public gene/protein interaction knowledge base. BAPA-IGGFD (Zhao et al., 2012) and TBScore (Ibrahim et al., 2012) acronyms were assigned to the respective methods, in this manuscript, for ease of reference. The commercial tools, Pathway-Guide and MetaCore are not included in this figure.
Figure 4Timeline showing the availability of pathway analysis tools that work mainly with metabolic pathways.
Comparison of topology-based pathway analysis methods based on different criteria related to the input.
| ScorePAGE | All genes expression | KEGG metabolic | 2004 | Rahnenführer et al., |
| MetaCore | DE genes list | Literature-based genome-scale interaction network; proprietary canonical pathway, genome-scale network | 2004 | N/A |
| Pathway-Express | DE genes with values, All genes expression | KEGG signaling | 2005 | Khatri et al., |
| TAPPA | All genes expression | KEGG metabolic | 2007 | Gao and Wang, |
| PathOlogist | All genes expression | KEGG | 2007 | Efroni et al., |
| Pathway-Guide | DE genes with fold change (FC) values, DE genes list, All genes with values, DE genes with FCs and | KEGG signaling, REACTOME, NCI, BioCarta | 2009 | N/A |
| SPIA | DE genes with values | KEGG signaling | 2009 | Tarca et al., |
| NetGSA | All genes expression | KEGG signaling | 2009 | Shojaie and Michailidis, |
| PWEA | All genes expression | YeastNet | 2010 | Hung et al., |
| TopoGSA | DE genes list | Genome-scale PPI network, KEGG | 2010 | Glaab et al., |
| PARADIGM | All genes expression, copy number, proteins level | Constructed PPI networks from MIPS, DIP, BIND, HPRD, IntAct, and BioGRID | 2010 | Vaske et al., |
| TopologyGSA | All genes expression | NCI-PID | 2010 | Massa et al., |
| DEGraph | All genes expression | KEGG | 2010 | Jacob et al., |
| MetPA | DE metabolites with values | KEGG metabolic | 2010 | Xia and Wishart, |
| BPA | All genes expression - with cut-off | NCI-PID | 2011 | Isci et al., |
| GANPA | DE genes with values, All genes expression | Genome-scale PPI network, KEGG, REACTOME, NCI-PID, HumanCyc | 2011 | Fang et al., |
| BAPA-IGGFD | All genes expression - with cut-off | Literature-based gene-gene interaction database, KEGG, WikiPathways, REACTOME, MSigDB, GO BP, PANTHER; constructed gene association network from PPIs; co-annotation in GO Biological Process (BP); and co-expression in microarray data | 2012 | Zhao et al., |
| CePa | DE genes list / All genes expression | NCI-PID | 2012 | Gu et al., |
| THINK-Back-DS | DE genes with values, All genes expression | KEGG, PANTHER, BioCarta, REACTOME, GenMAPP | 2012 | Farfán et al., |
| TBScore | DE genes with values | KEGG signaling | 2012 | Ibrahim et al., |
| ACST | All genes expression | KEGG signaling | 2012 | Mieczkowski et al., |
| EnrichNet | DE genes list | Genome-scale PPI network, KEGG, BioCarta, WikiPathways, REACTOME, NCI-PID, InterPro, GO with STRING 9.0 | 2012 | Glaab et al., |
commercial methods;
released in 2013 as part of ROntoTools.
(http://www.bioconductor.org/packages/release/bioc/html/ROntoTools.html)
N/A, No publication available. Experiment input describes the type of experiment data input required by the method. The meaning of each term is as follows: “DE genes with values/DE metabolites with values” represents the list of differentially expressed (DE) genes or metabolites with their fold-change value or t-statistics. Sometimes this list is accompanied by the list of total genes monitored in the experiment; “DE genes list” represents a list of selected genes, usually DE genes (this is just a list of IDs, without associated fold-changes). “All genes expression” represents the list of all genes in all samples together with their expression values. Some methods require all genes, but then perform the analysis using a flag for the DE genes - these are marked as “with cut-off.” Some methods use one type of input in a gene weighting stage while using another type of input to assess the pathway significance. Interaction network type and database name is the input knowledge source for the analysis method and the databases proposed by the software. Some of the methods can use any pathway, but provide parsed data for the pathway databases listed here. “Pathway” refers to any kind of signaling or metabolic pathway or gene regulatory network. “Genome scale interaction network” refers to interaction networks constructed from protein interactions or co-annotation from GO databases, literature, or co-expression inferred from existing microarray experiments. “Constructed network” means that the analysis method uses pathways created by its authors rather than pathways from a reference database. Year denotes the year of the first published paper describing the method. References denotes the first published paper describing the method.
Figure 6Comparison of the mathematical models of the surveyed pathway analysis methods. “Aggregate scoring” and “Weighted gene set” panels show methods that perform node-level scoring followed by pathway-level scoring performed either as an aggregation of the node scores or as a weighted gene set analysis, using the node scores as weights. The methods are divided according to their node-level scoring methods: graph measure techniques, similarity measurement techniques, probabilistic models, or using normalized node values based on node value and/or pathway structure. The “Multivariate scoring” methods use multivariate scoring models without node-level scoring. They use node values to directly compute a pathway score using Bayesian networks or applying multivariate hypothesis tests.
Figure 7Diagram of pathway analysis scoring approach for hierarchically aggregated scoring algorithms. The box with the dashed border indicates that the user can choose these options, but are not offered by the method implementation.
Figure 8Diagram of pathway analysis scoring approaches for multivariate scoring algorithms.
Comparison of topology-based pathway analysis methods using criteria related to the mathematical model and implementation.
| ScorePAGE | Single-type, undirected | Hierarchical, similarity | Standalone | N/A | R | on demand |
| MetaCore | Single-type, directed | Hierarchical, graph measures | Web-based, Standalone | Thomson Reuters | Java | Reuters, |
| Pathway-Express | Single-type, directed | Hierarchical, graph measures | Web-based, Standalone | free | Java, R | Drăghici et al., |
| TAPPA | Single-type, undirected | Hierarchical, NNV | Standalone | N/A | Java | N/A |
| PathOlogist | Multi-type, directed | Hierarchical, probability | Standalone | CC-BY | MATLAB | Greenblum et al., |
| Pathway-Guide | Single-type, directed | Hierarchical, graph measures | Standalone | Advaita Corporation, 2013 | Java | Advaita Corporation, |
| SPIA | Single-type, directed | Hierarchical, graph measures | Standalone | GPL (>=2) | R | Tarca et al., |
| NetGSA | Single-type, directed | Mutivariate, hypothesis test | Standalone | GPL-2 | R | Shojaie, |
| PWEA | Single-type, undirected | Hierarchical, similarity | Standalone | free | C++ | Hung, |
| TopoGSA | Single-type, undirected | Hierarchical, graph measures | Web-based | free | PHP, R | Glaab et al., |
| PARADIGM | Multi-type, directed | Hierarchical, probability | Web-based, Standalone | free | C | Vaske and Benz, |
| TopologyGSA | Single-type, moral undirected | Mutivariate, hypothesis test | Standalone | AGPL-3 | R | Massa and Sales, |
| DEGraph | Single-type, undirected | Mutivariate, hypothesis test | Standalone | GPL-3 | R | Jacob et al., |
| MetPA | Single-type, directed | Hierarchical, graph measures | Web-based | free | PHP, R | Xia, |
| BPA | Single-type, DAG | Mutivariate, Bayesian network | Standalone | free | MATLAB | Isci, |
| GANPA | Single-type, undirected | Hierarchical, graph measures | Standalone | GPL-2 | R | Fang et al., |
| BAPA-IGGFD | Single-type, DAG | Mutivariate, Bayesian network | Standalone | N/A | R | N/A |
| CePa | Single-type, directed | Hierarchical, graph measures | Web-based, Standalone | GPL (>= 2) | R | Gu, |
| THINK-Back-DS | Single-type, directed | Hierarchical, graph measures | Web-based, Standalone | free | Java | Farfán et al., |
| TBScore | Single-type, directed | Hierarchical, normalized node value (NNV) | N/A | N/A | N/A | N/A |
| ACST | Single-type, directed | Hierarchical, NNV | Standalone | CC-BY | R | Mieczkowski et al., |
| EnrichNet | Single-type, undirected | Hierarchical, graph measures | Web-based | free | PHP | Glaab, |
commercial methods;
free for academic and non-commercial use; UCSC-CGB – the University of California Santa Cruz Cancer Genome Browser;
N/A No publicly available implementation, Graph model indicates whether the graph which is remodeled to be suitable for the scoring method is single-type or multi-type and whether it is directed or undirected. DAG stands for directed acyclic graph. The moral graph is described in Section 3.1. Scoring method encloses the mathematical model used in the analysis to score nodes and graphs. A detailed description is presented in Section 3.2. Implementation indicates the existence of a standalone or web-based implementation of the method. License represents the license under which the software is available. GPL - GNU General Public License, AGPL - GNU Affero General Public License, CC-BY - Creative Commons license. Language represents the programming language used for the implementation. Available from points to the paper or url associated with the given tool.