Literature DB >> 29868818

A comparison of mechanistic signaling pathway activity analysis methods.

Alicia Amadoz¹, Marta R Hidalgo², Cankut Çubuk², José Carbonell-Caballero³, Joaquín Dopazo^2,3,4.

Abstract

Understanding the aspects of cell functionality that account for disease mechanisms or drug modes of action is a main challenge for precision medicine. Classical gene-based approaches ignore the modular nature of most human traits, whereas conventional pathway enrichment approaches produce only illustrative results of limited practical utility. Recently, a family of new methods has emerged that change the focus from the whole pathways to the definition of elementary subpathways within them that have any mechanistic significance and to the study of their activities. Thus, mechanistic pathway activity (MPA) methods constitute a new paradigm that allows recoding poorly informative genomic measurements into cell activity quantitative values and relate them to phenotypes. Here we provide a review on the MPA methods available and explain their contribution to systems medicine approaches for addressing challenges in the diagnostic and treatment of complex diseases.

Entities: Chemical Disease Gene Species

Keywords: disease mechanism; mathematical models; networks; signaling pathways; systems biology; transcriptomics

Mesh：

Year: 2019 PMID： 29868818 PMCID： PMC6917216 DOI： 10.1093/bib/bby040

Source DB: PubMed Journal: Brief Bioinform ISSN： 1467-5463 Impact factor: 11.622

Background

Since the beginning of the century, two technological waves (first microarray and subsequently massive sequencing) have been producing an increasingly vast amount of genomic data on gene variation and gene expression for different phenotypes, diseases and conditions [1]. As these methodologies gain in maturity, different statistical methods were used to derive progressively reliable and robust biomarkers and gene signatures [2] that, despite not being exempt of problems [3, 4], have contributed toward increasing the knowledge on the relationships between genotypes and phenotypes and have become mainstream in clinics [5, 6]. A major drawback from biomarkers and gene signatures is that they do not have an easy interpretation because they frequently lack any mechanistic link to the fundamental cellular processes that account for the phenotype studied. Actually, phenotypic differences are better understood as alterations in the operation of functional modules in the cell, which can be multiprotein complexes, a pathway or a single cellular or subcellular organelle [7]. Such alterations can be caused by different combinations of perturbations (mutations or gene expression changes) of functionally related genes [8, 9]. The interest in defining these biological modules and using them to interpret gene-based analyses resulted in early proposals of functional enrichment methods [10-12]. Such approaches consider functional modules just as plain lists of genes without taking into account either the underlying pathway network topology or the involved functional relations among genes. In the simplest approach, the over-representation analysis (ORA), the fraction of genes belonging to a particular gene set found within a predefined list of genes (e.g. genes showing significant changes in expression) is assessed [13, 14] (Figure 1B and C). A subsequent and more sensitive approach, the gene set enrichment analysis (GSEA), does not need a predefined list of genes but rather searches for functionally related gene sets constitutively up or down within all the genes studied, ranked by a relevance criterion (e.g. differential expression or association to a disease) [15-20]. Typical gene sets used in GSEA are gene ontology [21] categories, but other applications include gene sets sharing common regulation by transcription factors [22] or miRNAs [23], or even assess the impact of regulatory dynamics within functional gene sets, like the ROMA method [24]. However, similarly to ORA, GSEA methods ignore the complex network of functional relations among the proteins that compose the gene sets. Another generation of methods, generically known as pathway topology-based (PT-based) [11] or structure-based [25] methods, constitute different versions of GSEA for structured gene sets (pathways defined in the Kioto Encyclopedia of Genes and Genomes (KEGG) [26], Reactome [27], etc.). Contrarily to ORA and GSEA, these methods take into account the internal relationships among the proteins that compose the pathway (see Figure 1D and 1E for an example of the importance of internal gene–gene connections in the definitions of gene sets). Methods such as SPIA [28], CePa [29], NetGSA [30] or Pathifier [31] use different strategies to produce an enrichment value of the pathway, which is weighted according to the relationships among the activated constituent genes. Even with this approach, assessing and scoring whole pathway activities limits our understanding on the actual functionalities, triggered by specific subpathways of the pathway, carried out by the cell in the conditions studied. Actually, these methods ignore the obvious fact that many pathways are multifunctional entities composed of different subpathways that often may trigger very different and even opposite cell behavioral outcomes, depending on the dynamic progression of the signal propagation between the receptor and the effector proteins. For example, the apoptosis pathway may trigger cell survival or death depending on the specific receptor activated and how the signal is transduced within the pathway to reach a specific effector protein at the end of the signaling circuit. Actually, the most frequently used pathway descriptions (KEGG [26], Reactome [27], Pathway Commons [32], WikiPathways [33], etc.) are abstract concepts built around whole biological notions (e.g. apoptosis or cell cycle) and can be understood as compendiums of different possible cell activities that ultimately account for phenotypes (e.g. disease mechanisms or drug mechanisms of action—MoA). These cell activities are caused by the combined action of different genes that, in turn, are highly pleiotropic and often participate in more than one cell activity. This is the reason why gene activity by itself is often not descriptive of the cell functionality because it depends critically on the activity of other partner genes in the pathway to produce specific cell functionalities, as described in the map of interactions that is the pathway. On the other hand, the activity of the whole pathway does not provide a solution either. Activity measurement at the level of the whole pathway does not have a unique mechanistic consequence, and therefore, its use for the interpretation of genomic or transcriptomic experiments is purely indicative and of limited practical utility. Actually, methods like OCSANA [34] that seek minimal combinations of interventions to disrupt the subnetworks that connect source nodes and target nodes point to substructures within the pathway instead of whole pathways as the relevant entities to target. Moreover, evidences from the experimental side consisting of the prediction of patient survival based on the activity of the JNK subpathway provided a solid demonstration that subpathways can be considered real mechanistic biomarkers of clinical utility [35].

Figure 1.

Schematic representation of the three families of methods: enrichment analysis, PT-based analysis and MPA. The conventional enrichment analysis assumes the existence of a background (A) in which an observed percentage (25% in the example) of the genes differentially expressed (or mutated, associated to a trait, etc.). If gene sets are sampled based on some property shared by all the genes (e.g. they belong to a given pathway), a scenario (B) in which 60% of them are differentially expressed is found; the application of a simple test will evidence that this gene set is significantly enriched in differentially expressed genes, whereas in other scenarios (C), the gene set would not be different from a random sample of genes from the background. A PT-based algorithm takes into consideration the topology of the gene set, and a scenario (C) in which the differentially expressed genes are more connected among them would get a better score that an alternative scenario (D) in which the level of connection of the genes is lower. The significance of this data set would depend on the algorithm that estimates the score and the specific test applied. In MPA, there is more or less specific definition of circuits (subnetworks) within the pathway that should be related to cell activity in some way, and the connectivity of such circuits will determine the potential changes in cell activity. If circuits define subnetworks connecting receptor proteins to effector proteins in a signal transduction pathway, the same number of active genes could allow signal transduction (E) or being incompatible with the arrival of the signal to the current effector proteins (F), even in scenarios that would be significantly enriched in a conventional enrichment method. In an attempt to gain a more precise knowledge on the functional consequences of the activity of pathways, methods that focus on mechanistic pathway activity (MPA) have emerged. MPA methods constitute a new paradigm by providing an alternative way to study the specific activities of internal elementary subpathways or circuits that compose the whole pathway. Circuits are subpathways, defined in different ways by different MPA methods, which are expected to be better descriptors of cell functional activity than whole pathways or single genes. Circuit definitions made by different MPA methods range from simple cliques or strings of connected nodes, that most likely account for unspecific activity within the pathway [36-39], to subpathways with a sound biological foundation (e.g. receptor-to-effector signaling circuits, which are defined from the chain of proteins that connect a specific receptor protein to a specific effector protein at the end of the signaling circuit) that are known to account for specific cell behavioral outcomes [40-43]. Under this mechanistic perspective, what is relevant is not the pure enrichment or a high connectivity by itself, but rather that the observed connectivity within the gene set will be compatible with a biological mechanism of the cell as, for example, signal transduction in a signaling pathway (Figure 1F and G). Most of the methods for computing circuit activity scores use gene expression values [36, 37, 40–42], but also methods that use mutations [44], have been proposed. The main difference with respect to PT-based methods is that MPA methods directly focus on pathway substructures and calculate the corresponding score for them rather than using such substructures to calculate a global score for the whole pathway. MPA methods are likely to gain importance as systems medicine becomes increasingly relevant in understanding complex diseases and helping in their precise diagnosis and therapeutic decisions [45]. Initiatives such as Disease Maps (http://disease-maps.org/) have already provided detailed maps of complex gene relationships among genes in several diseases, such as cancer [46], Parkinson [47], Alzheimer [48] and influenza [49], and eight more are under development. These curated disease maps, along with the classical pathways, such as KEGG [26], Reactome [27], Pathway Commons [32], WikiPathways [33] and others [50], are detailed static diagrams whose dynamics and activity can be understood with MPA methods. Despite the fact that numerous reviews of pathway enrichment methods can be found [10, 51], along with a few reviews on pathway-topology analysis methods [11, 25], to our knowledge, there are no reviews on specific methods that focus on testing independently the activity of pathway substructures (circuits). Here, we present a comprehensive review of MPA methods in which their relative advantages, drawbacks and performances are compared.

Methods

Methods analyzed

Contrarily to different versions of GSEA methods, such as PT-based [11] or structure-based [25] methods, in which pathway structures are taken into account to calculate a topology-weighted score for the whole gene set, MPA methods use pathway topologies to define subnetworks within, based on different criteria, which potentially constitute better descriptors of the real functional activity of the cell. MPA methods differ mainly in the way in which they define circuits within the pathways and in the manner that they estimate the activity of such circuits. A comprehensive collection of the available MPA methods is listed in Table 1. The most pervasively used pathway definition across all the methods is KEGG. Some methods use the KEGG pathway map to define circuits within with a biological rationale (e.g. receptor-to-effector circuits that effectively model signal transduction events), whereas others use the pathway topology to find subgraphs that behave differently among the compared conditions (Table 1). Scoring methods are also diverse, and in some cases, circuit activities can be calculated for individuals and then a test can be performed if a comparison is necessary (e.g. Pathiways [41, 42], HiPathia [40], developed by us, MinePath [52], TAPPA [39]), whereas other methods require of a comparison either to use differentially expressed genes (DEG) or to find subgraphs that explain the comparison. Another interesting aspect is that almost half of the methods do not distinguish between activations and inhibitions in the calculation of the score, despite the availability of this information and its potential importance (Table 1). Since circuit definition and scoring methods used will potentially have an impact on the relative performances of the methods, we have carried out a comparative benchmark to study their relative importance. From the methods listed in Table 1, nine were selected that satisfy three basic conditions: they can be applied to RNA-seq data, they have a common definition of pathway (KEGG pathways constituted the unique common pathway definition) and there is software available for running them.

Table 1.

List of mechanistic pathway activity methods

Method	Date	Code	Pathway modeled	Circuit definition	Scoring method	Activation / inhibition	Input	Result	Scope
Hipathia [40]	2017	Web application	KEGG	Receptor-to-effector circuits	Propagation algorithm	Yes	MA, RNA-seq	P-value per circuit	Multiple analyses
		http://hipathia.babelomics.org
		Hipathia R code
MinePath [52]	2016	Web application	KEGG	All possible circuits	Discretized gene expression (GE) values with logical operators	Yes	MA RNA-seq	P-value per circuit	Multiple analyses
MinePath [52]	2016	http://minepath.org/	KEGG	All possible circuits		Yes	MA RNA-seq	P-value per circuit	Multiple analyses
subSPIA [53]	2015	R code	KEGG	Minimal spanning trees (MST)	Differentially expressed (DE) genes used to define the MST	No	MA RNA-seq	P-value of DE per circuit	Comparison
PathiVar [44]	2015	Web application	KEGG	Receptor-to-effector circuits	Probabilistic model	Yes	MA, VCF	P-value per circuit	Multiple analyses
PathiVar [44]	2015	http://pathivar.babelomics.org	KEGG	Receptor-to-effector circuits	Probabilistic model	Yes	MA, VCF	P-value per circuit	Multiple analyses
Pathome [54]	2014	NA	KEGG	Receptor-to-effector linear circuits	Correspondence between pattern activation/inhibition and co-expression in adjacent gene pairs	Yes	MA RNA-seq	P-value per circuit	Comparison
Pathiways [41, 42]	2013	Web application http://pathiways.babelomics.org	KEGG	Receptor-to-effector circuits	Probabilistic model	Yes	MA	P-value per circuit	Multiple analyses
DEAP [55]	2013	Python code	KEGG	Receptor-to-effector linear circuits	Running sum of discretized DE	Yes	MA RNA-seq	Maximally differential expressed pathway	Comparison
CLIPPER [37] and GraphiteWeb [56]	2013	R package	KEGG; Reactome	All possible circuits	Weighted sum of GE	No	MA RNA-seq	Most relevant circuit per pathway	Comparison
		ToPASeq R package
		Web application:
		http://graphiteweb.bio.unipd.it/
TEAK [57]	2013	Code @ Google (Windows and Mac)	KEGG	Receptor-to-effector circuits	Fits a Bayesian network for circuit and uses the BIC	No	MA	Ranked circuits	Comparison
PRS [58]	2012	ToPASeq R package	KEGG	Trees of associated DE genes	Topologically weighted sum of DE	No	MA RNA-seq	Ranked subpathways	Comparison
DEGraph [36]	2012	R package	KEGG; User defined pathways	All possible circuits	Multivariate two-sample tests of means of DE genes within a subgraph.	No	MA RNA-seq	P-value of DE per circuit	Comparison
DEGraph [36]	2012	ToPASeq R package	KEGG; User defined pathways	All possible circuits		No	MA RNA-seq	P-value of DE per circuit	Comparison
Rivera et al. [59]	2012	NA	NetPath	All possible circuits	Weighted Z-score of genes within the subgraph	No	MA	P-value of most perturbed circuit	Comparison
Chen et al. [60]	2011	NA	KEGG	Receptor-to-effector circuits	Euclidian distance	No	MA	P-value per circuit	Comparison
PWEA [61]	2010	ToPASeq R package	User defined pathways	All possible circuits	Mutual influence among gene expression within the circuit	No	MA RNA-seq	P-value of DE per pathway	Comparison
TopologyGSA [38]	2010	ToPASeq R package	User defined pathways	All possible circuits	Comparison of covariance matrices of genes in the circuit	Yes	MA RNA-seq	P-value of DE per pathway	Comparison
DEGAS [62]	2010	Java (Windows)	KEGG	All possible circuits	Heuristic to find the largest dysregulated circuit	No	MA	One circuit per pathway	Comparison
TAPPA [39]	2007	ToPASeq R package	KEGG	All possible circuits	Scores of co-expression that explain the compared conditions	No	MA RNA-seq	P-value of DE per pathway	Multiple analyses

The first column (Method) contains the name or acronym of the method, if exists, otherwise, we refer to it as the first author of the publication. The second column (Date) contains the publication date. The third column (Code) informs on the availability of the code to run the method. The fourth column (Pathway modeled) indicates the database used for pathway definition used by the method. The fifth column (Circuit definition) is the type of circuit used by the method. The sixth column (scoring method) summarizes how the circuit activity is inferred in the method. The seventh column (Activation/inhibition) denotes whether the scoring method uses the information of activation or inhibition nodes. The eight column (Input) indicates the data type that inputs the method (MA: expression microarray; RNA-seq: counts of RNA-seq experiments; VCF: mutation files). The ninth column (Result) describes the results provided by the method. And the tenth column (Scope) indicates the type of analyses the method permits, which can be either only conventional two conditions comparison or a wide range of analyses if the method first recodes gene expression into circuit activities.

List of mechanistic pathway activity methods The first column (Method) contains the name or acronym of the method, if exists, otherwise, we refer to it as the first author of the publication. The second column (Date) contains the publication date. The third column (Code) informs on the availability of the code to run the method. The fourth column (Pathway modeled) indicates the database used for pathway definition used by the method. The fifth column (Circuit definition) is the type of circuit used by the method. The sixth column (scoring method) summarizes how the circuit activity is inferred in the method. The seventh column (Activation/inhibition) denotes whether the scoring method uses the information of activation or inhibition nodes. The eight column (Input) indicates the data type that inputs the method (MA: expression microarray; RNA-seq: counts of RNA-seq experiments; VCF: mutation files). The ninth column (Result) describes the results provided by the method. And the tenth column (Scope) indicates the type of analyses the method permits, which can be either only conventional two conditions comparison or a wide range of analyses if the method first recodes gene expression into circuit activities.

MPA algorithms

TAPPA [39] was the first MPA method proposed by 2007. Although originally proposed as a PT-based algorithm, its implementation in the ToPASeq package [63] allows using it on not only whole pathways but also subpathways. The method is based on the concept of molecular connectivity, a well-known topological descriptor of chemical compounds in chemioinformatics [64]. In essence, the method calculates a score of gene co-expression, taking into account the adjacency in the topology of the pathway. Here, circuits are defined as all the possible subnetworks in the pathway. This method does not take into account activation/inhibition activities of the genes. Circuits with a molecular connectivity associated to the conditions compared by means of a Mann–Whitney test (for binary traits) or a Spearman correlation (for continuous traits) are considered to have a significant differential activity. Significant circuits that do not connect receptors to effector proteins are most probably not involved in triggering changes of cell functionality and consequently they will not be relevant to explain changes in cell behavior. PWEA [61] was proposed in 2010. Similarly to TAPPA, this method was originally proposed as a PT-based algorithm, but the ToPASeq [63] implementation allows using it on subpathways. The method calculates for each gene its mutual influence with respect to the rest of genes within the pathway (or subpathway), obtained as a function of the correlation in their expression values. Then, for each pathway (or subpathway), it calculates a cumulative function based on all the genes within. The significance of this cumulative function is tested by means of a permutation test to compare the intrapathway with respect to the background (out of the pathway cumulative mutual influence). In these calculations, activation or inhibition gene activities are ignored. In the ToPASeq implementation, the circuits made out of KEGG pathways were defined as any subnetwork within the pathways, which again can define circuits not linked to the activation of cell functions. CLIPPER [37], published in 2012, is a generalization of TopologyGSA [38]. The first step of the method converts pathways into gene–gene networks. Given that CLIPPER method only works on acyclic graphs, self-loops and cycles are solved by removing the weakest edge of the cycle based on expression data (with minimum expression profile correlation between nodes [65]). Then, the circuits are defined as receptor-to-effector subnetworks. Gene expression values are assigned to the nodes of the circuits, and the algorithm compares portions (any possible subnetwork) of the circuit between the conditions studied, with the aim of identifying subgroups of genes that appear to drive differences (deregulations) between them. CLIPPER does not distinguish between activation and inhibition activities in the proteins of the pathway. If the portion found as significant does totally connect receptors to effector proteins, the subnetworks detected could not have any relevance in terms of cell functionality. PRS [58] was proposed in 2012. The algorithm maps discretized values of differential gene expression (a threshold of twofold is taken as differential expression and value is set to 1, otherwise to 0, without any further statistical assessment) at the nodes of the gene–gene directed network extracted from KEGG pathways, without considering their activation or inhibition roles. Then, the network is traversed with a depth-first search algorithm that essentially seeks for subnetworks of connected differentially expressed nodes by means of a normalized weighted sum of node values. A permutation test is used to assess the reliability of the subnetworks found. These significant subnetworks constitute the circuits defined by PRS, which, as in CLIPPER, could lack any mechanistic link with changes in cell functionality, if they do not connect receptors to effector proteins. DEGraph [36] was proposed in 2012. This is a complex statistical approach in which first the topology of the graph representing the pathway undergoes Fourier analysis in which some additional information based on gene expression correlations is incorporated. Then, the method performs a Hotelling’s test in the graph–Fourier space for discovering non-homogeneous subgraphs within the whole pathway topology that exhibit a significant shift in means, by means of a pruning approach to reduce testing time and multiple testing. The way in which the algorithm defines circuits as significant subnets, extracted from the analysis of the graph–Fourier space, does not take into account activation/inhibition gene activities and do not specifically focus on circuits that can mechanistically explain cell activity. DEAP [55] was proposed in 2013. The algorithm superimposes expression data onto the pathway graph. Every possible path that can be formed within the graph is independently examined by using a recursive function that calculates the differential expression for each path by adding or subtracting the discretized differential expression value of all downstream nodes with catalytic or inhibitory relationships, respectively. DEAP focuses on discovering the most differentially expressed portion of the pathway. Again, from a pure mechanistic point of view, it could be argued that the activity of some portions of the pathway does not necessarily involve a functional activity within the pathway. SubSPIA [53] was proposed in 2015. SubSPIA maps the DEG into the pathway structure. Then, the method tries to connect all the possible DEG using the minimum possible number of non-DEGs by using a minimum spanning tree strategy. Activations and inhibitions are not taken into account. Circuits defined in this way do not necessarily involve a functional activity within the pathway. MinePath [63] was proposed in 2016. The method first discretizes the expression values of the genes to active/inactive categories. Then, each pathway is decomposed in all the possible subgraphs, over which the gene expression categories are mapped. The functional status of any subgraphs is assessed by the Boolean algorithm that takes into account the activity of the involved genes and the type of relationship among them (activations and inhibitions). When the activity status changes between the conditions are compared, the subgraph is considered to be a differentially activated circuit. Again, the circuits found under this strategy do not necessarily involve a functional activity within the pathway. HiPathia [40] was proposed by us in 2017. The method defines circuits as subgraphs that connect receptor proteins to effector proteins in the pathways. These types of circuits represent the logic of signal transduction in the cell and are related to the mechanistic of cell functionality. HiPathia uses the normalized gene expression values as proxies of protein activity and, taking into account the relationships among proteins (activation/inhibition) along the pathway, it uses a propagation algorithm to estimate the amount of signal that arrived to the effector protein from the receptor protein. Then, the profiles of circuit activity are compared by means of a conventional Wilcoxon test.

Data source

A total of 6246 samples, including 5647 tumor samples and 598 samples taken from the healthy reference tissue, belonging to 12 cancer types from The Cancer Genome Atlas (TCGA) data portal (https://tcga-data.nci.nih.gov/tcga/) were used for the benchmarking. The following cancers were used: bladder urothelial carcinoma [66], breast invasive carcinoma [67], colon adenocarcinoma [68], head and neck squamous cell carcinoma [69], kidney renal clear cell carcinoma [70], kidney renal papillary cell carcinoma [71], liver hepatocellular carcinoma, lung adenocarcinoma [72], lung squamous cell carcinoma [73], prostate adenocarcinoma [74], thyroid carcinoma [75] and uterine corpus endometrial carcinoma [76]. Fourteen KEGG [26] pathways belonging to the subcategory of ‘Pathways in cancer’ were used to detect changes when cancers versus control comparisons were made. These are MAPK signaling pathway (hsa04010), Wnt signaling pathway (hsa04310), TGF-beta signaling pathway (hsa04350), VEGF signaling pathway (hsa04370), Jak-STAT signaling pathway (hsa04630), cAMP signaling pathway (hsa04024), PI3K-Akt signaling pathway (hsa04151), mTOR signaling pathway (hsa04150), cell cycle (hsa04110), apoptosis (hsa04210), p53 signaling pathway (hsa04115), focal adhesion (hsa04510), adherens junction (hsa04520) and PPAR signaling pathway (hsa03320).

Data processing

The COMBAT method [77] was used to remove non-biological experimental variations (batch effect) associated to a Genome Characterization Center (GCC) and plate ID from the RNA-seq data. Then, the trimmed mean of M-values normalization method (TMM) method [78] was used for data normalization. The resulting normalized values were used as input for the MPA methods.

Sensitivity of the methods

To estimate sensitivity or the true positive rate (TPR), we used the 12 cancer types mentioned earlier. In any of the 12 normal versus tumor comparison, we expect to detect changes of activity in the 14 cancer KEGG pathways. As different methods have different circuit definitions, the comparison of the methods was carried out at the level of the whole pathway definition. This is a quite liberal estimation because the detection of activity within a pathway by two methods does not mean that the activity detected really corresponds to the same cell behavioral outcome. However, possible overoptimistic TPRs obtained by this approximation will be compensated by low specificity determinations. Therefore, for each method, we estimated a rate of true positive as the number of cancer pathways in which the method discovered one or more differentially activated circuits divided by 12, the total number of cancer pathways. The distributions were ranked by TPR value and compared by means of a Wilcoxon test with Bonferroni correction to detect significant differences among them.

Specificity of the methods

The false positive rate (FPR) is estimated by comparing two groups composed by identical individuals for discovering differentially activated circuits in any of the cancer KEGG pathways. As the individuals compared belong to the same cancer type, any differentially activated circuit would be a false positive detection. For each comparison, we used two data sets of 25 tumor samples, each randomly chosen from each of the cancer types analyzed. Comparisons between both data sets were repeated 100 times per method and cancer. Similarly to the case of TPR, the FPR of any method is calculated as the number of cancer KEGG pathways in which the method finds one or more circuits significantly activated divided by the total number of pathways analyzed (14 pathways). Again, the distributions were ranked by FPR value and compared by means of a Wilcoxon test with Bonferroni correction to detect significant differences among them. If a method has a too liberal detection of active pathways, its high sensibility will be compensated with a low specificity.

Results

Data pre-processing

A large data set of RNA-seq counts for 12 cancer types analyzed was used for the benchmark provided in this review. The data were downloaded from TCGA data portal (https://tcga-data.nci.nih.gov/tcga/). The data were normalized using TMM [78] to account for RNA composition bias. Batch effect was corrected by the application of the COMBAT [77] method. Normalized data were used as input for the MPA methods.

Estimation of the sensitivity of the MPA methods

To estimate the relative TPR of the MPA methods analyzed, we compared cancer samples versus the corresponding healthy tissue for the 12 cancer types analyzed. In all these cancer versus control comparisons, statistically significant differences in the KEGG cancer-associated pathways are expected. Figure 2 shows how only HiPathia was able to detect changes in activity in circuits belonging to all the cancer pathways analyzed across the 12 cancer types analyzed. Other group of three methods (TAPPA, DEGraph and subSPIA), with a significantly different performance (P-value = 3 × 10−4), was able to detect between 50% and 75% of the cancer pathways used here. The rest of methods detected differential activity in less than 50% of the cancer pathways.

Figure 2.

TPR or sensitivity was computed as the number of significant cancer pathways found, when cancer samples are compared with samples of the tissue of reference, divided by the total number of cancer pathways (14 for HiPathia and DEAP and 13 for the rest of methods, because PPAR signaling pathway [hsa03320] was not implemented in them) per method and cancer. Violin plots obtained using 12 cancer types show for any method the mean TPR in the central dot, all possible results, with thickness indicating how common, in the outer shape and the layer inside, represents the values that occur 95% of the time. The figure shows the methods ranked by TPR value. A Wilcoxon test with Bonferroni correction was used to compare successive TPR distributions to detect significant differences among them. Black lines denote significant differences between consecutive methods. Brackets define groups of methods with no significant differences in their performances.

Estimation of the specificity of the MPA methods

To check whether the high sensitivity of HiPathia, TAPPA, DEGraph and subSPIA is real or is only the consequence of a low specificity, we estimated the FPR for them. To achieve this, data sets of identical samples were compared and significant differences in circuit activity found by a particular method in the comparisons are considered false positives. A total of 10,800 comparisons (100 times × 9 methods × 12 cancer types) between pairs of data sets of 25 samples each, randomly sampled among cancer samples, were performed. The FPR was computed as the mean of the number of significant cancer pathways divided by the total number of cancer pathways per method and cancer. Figure 3 shows how most of the methods have a low FPR (below the conventional 5% P-value), except PWEA, which displays a high ratio of false positives (over 30%). The best performer is SubSPIA (P-value = 0.005), which, together with CLIPPER and HiPathia methods, showed the highest specificity (P-value = 0.001).

Figure 3.

FPR or specificity was computed as the mean of the number of significant cancer pathways found, when cancer samples are compared with cancer samples, divided by the total number of KEGG cancer pathways along 100 bootstraps, per method and cancer. Violin plots show average values and distributions of the proportions of false discoveries made by any method. The figure shows the methods ranked by FPR value. A Wilcoxon test with Bonferroni correction was used to compare successive FPR distributions to detect significant differences among them. Black lines denote significant differences between consecutive methods. Brackets define groups of methods with no significant differences in their performances.

An example of MPA analysis of death and postmortem cold ischemia of blood transcriptome

Beyond the typical cancer study applications showed in all the MPA method articles, other biological systems can be studied at an unprecedented functional detail with the application of MPA methods. In a recent article [79], we studied the changes of blood transcriptome after death and subsequent postmortem cold ischemia and the functional consequences using the data available in the GTEx project [80]. Although changes in the transcriptome are expected as a response to the death, little is known about how death and the postmortem cold ischemia interval affect gene expression [81, 82]. In principle, gene expression measurements in postmortem samples will be affected both by biological responses to organism death and RNA degradation. MPA methods can help to understand the mechanisms of these effects and how they are dependent on the postmortem interval. This knowledge is essential for the proper use of postmortem gene expression measures as a proxy for antemortem physiological gene expression levels [83]. In the study, most of the changes in gene expression occur between 7 and 14 h after death, followed by a relative stabilization, between 14 and 24 h. The use of an enrichment method pointed to several processes affected by the gene expression changes, which include gene ontology terms related to immune response such as response to bacterium, response to other organisms, response to biotic stimulus, regulation of leukocyte migration, regulation of cytokine production and acute-phase response, to DNA replication, such as DNA packaging process, as well as some metabolic processes related to hypoxia, such as reactive oxygen process, peroxidase activity, oxidoreductase activity, carbohydrate binding and response to lipopolysaccharide [79]. All of them are very general processes and most of them can have an opposite outcome depending on the specific activity of the process (e.g. regulation of cytokine production can be positive or negative and there is no way of discovering the sense of the regulation from the pure enrichment of this ambiguous term). However, the application of the HiPathia [40] MPA method revealed significant changes in several relevant functional activities. Actually, HiPathia reveal a detailed picture on the ultimate mechanisms that cause the deactivation of the immune system. Regulation of interferon production is inhibited from several circuits of the apoptosis RIG-I-like receptor and the RIG-I-like receptor signaling pathways. Response to interleukin-1 is inhibited from the MAPK signaling pathway. Neutrophil activation is inhibited by a circuit of the Fc epsilon RI signaling pathway, whereas negative activity of natural killer cells is triggered from the natural killer cell-mediated cytotoxicity pathway (Figure 4A). As suggested by enrichment carbohydrate metabolism is affected, but MPA detects the mechanism behind, which involves a severe deactivation of the tricarboxylic acid cycle, while glycolysis is activated (FDR-adjusted P-value <10−27; [79] Figure 4D), and points to a major role in the initial pre- to postmortem transition for the hypoxia process (FDR-adjusted P-value 7.2 × 10−67), by the activation of HIF-1, mTOR, platelet activation and cGMP-PKG signaling pathways [79].

Figure 4.

Schema of the mechanisms behind the deactivation of the immune system (A) and the changes in the metabolism (B) caused by changes of blood transcriptome after death and subsequent postmortem cold ischemia [73]. Finally, activation of processes related to blood coagulation (FDR-adjusted P-value 1.52 × 10−54) and response to stress triggered by specific circuits in NF-kappa B, cAMP and HIF-1 signaling pathways (Figure 4B) are also detected by the MPA method.

Discussion

MPA, a new paradigm for the interpretation of genomic data

Transcriptomic experiments are affordable nowadays and provide a wealth of data that must be interpreted in the light of their biological consequences and implications. Different functional profiling methods have been proposed for this interpretation, such as over-representation methods or gene-set enrichment methods [15-18], that focus on the collective activity of genes within biologically relevant entities such as pathways. However, given that most pathways are multifunctional entities, these methods, even in their most sophisticated versions that consider the internal structure of the pathway, are simply illustrative and fail to provide real mechanistic information on specific behavioral outcomes of the cell. MPA methods provide an innovative, biologically inspired alternative for the interpretation of transcriptomic experiments. These methods use biological knowledge available on the interrelation among the genes that compose the pathways to provide hypothesis on how their perturbations ultimately cause diseases, responses to treatments or other complex phenotypes, such as the effects of death and postmortem cold ischemia on human tissues [78]. Therefore, MPA methods provide a link between gene perturbations (measured as gene expression changes) and disease mechanisms or drug MoAs [83, 84]. To understand how the different pathway definition and scoring strategies used by the different MPA methods capture biological information that account for cell behavior (such as signal transduction circuit activities) and relate them to phenotypic conditions, we have produced a benchmarking of nine MPA methods that could be compared in similar conditions.

Possible factors that determine the relative performance of MPA methods

Figure 5 offers a combined view of both specificity and sensitivity of the methods analyzed in this review. Whereas most of the MPA methods show an excellent specificity (below the typical 5% conventional P-value threshold, with the exception of PWEA), the differences in terms of sensitivity are more pronounced. According to their relative sensitivity and specificity, we could distinguish four groups of methods. The first group of high sensitive and specific methods would include HiPathia, the only one able to detect differences in the activities of all the cancer pathways across all the cancer types analyzed while maintaining a high specificity as well. A second group of methods with medium sensitivity but still high specificity, which includes TAPPA, DEGraph and SubSPIA, detects changes in only 75%–50% of the cases (P-value = 0.08). A third group of methods, with low sensitivity but still high specificity, which includes CLIPPER, DEAP, MinePath and PRS, shows poorer performance, detecting changes in circuit activities in less than 25% of the cases. Finally, the conceptually most different method, PWEA, does not only present a low specificity but also a low sensitivity.

Figure 5.

Simultaneous comparison of sensitivities and specificities of the different MPA methods. The results obtained in the 12 cancers are used to obtain a mean value and an error. The x-axis represents 1 − the FPR. Horizontal bars represent in each point 1 SD of the FPR for the corresponding method. It is difficult to attribute the relative performance of the methods to a unique factor and it rather seems to be a combination of several of them. Apparently, the use of receptor-to-effector definition of circuit and the distinction between activations and deactivations seem to be important factors that differentiate HiPathia from the rest of methods in terms of sensitivity. MinePath and DEAP also use activation/inhibition information to calculate the score and DEAP uses receptor-to-effector circuit definitions, but in both cases, the scoring algorithm uses discretized values of differential gene expression, which seem to reduce drastically the sensitivity. The most representative feature of the second group of methods seems to be the use of differential gene expression or co-expression to obtain scores for circuits. These circuits that can effectively separate the conditions compared are chosen as differentially activated. In the third group, showing poorer sensitivity (below 25%), the discretization of differential gene expression values seems to represent a hurdle for obtaining a better sensitivity for two of the methods (DEAP and MinePath). The case of CLIPPER and PRS is probably related to a combination between the scoring strategy and the circuit definition. Finally, the PWEA presents, in addition, a low specificity. Probably, the PWEA case is a combination of the circuit definition and a scoring algorithm, based on mutual influence among genes, which is not capturing properly the underlying biology of the pathway activity. It must be stated that all the methods included in the three first groups discover differentially activated circuits efficiently and with a low rate of false positives, thus providing a more informative interpretation than the simple description of differential gene expression. Moreover, all the methods in their original publications demonstrated to be more sensitive than the conventional ORA and GSEA methods [36, 37, 39, 40, 52, 53, 55, 58, 61]. Receptor-to-effector subpathways are relevant circuit definitions from a biological standpoint, as they represent the possible routes taken from the beginning of a pathway, where the signal is originated, to its end, where a function is triggered. Within the context of signaling pathways, such circuit definitions effectively model signal transduction events. MPA methods implementing these circuit definitions model more realistically biological events and consequently produce better results. In addition, an interesting feature of the methods that use receptor-to-effector circuits is that the changes in the activity of such circuits can be easily associated to cell functionalities triggered by the effector protein [40]. Contrarily, given the fact that many genes and subnetworks can be shared by several pathways, pathway definitions based in subnetworks are, in principle, more prone to false positives. Most of the MPA methods are based on the use of differential gene expression or on any type of comparison-derived scoring system, which restricts its analytic use only to comparison scenarios. An interesting feature for a MPA method is the possibility of producing circuit values for each individual sample. This allows transforming an uninformative matrix of [samples × gene] expression values into a more informative matrix [samples × circuit activities] that can be used for detecting differentially active circuits, but also for many other types of analysis such as clustering, predictors, time-series analysis, survival analysis, patient stratification, etc. For example, the Pathiways method [41, 42] was used to predict cell response to different drugs [85]. Because most of the methods only accept KEGG pathways as input (Table 1), it is not possible to test potential biases of the different methods under different pathway definitions. In principle, given the way that the methods tested work, the two most important factors prone to cause any bias would be the length of the pathways and the existence of more or less loops. In principle, longer pathways would affect the performance in terms of runtime and, in some cases, could introduce a multiple testing issue in methods that decompose the pathway in all the possible subpathways. TAPPA and MinePath will probably be affected in their runtimes, and methods such as PRS of DEGraph even more and could even be impracticable for large pathway definitions, given the algorithm behind the definition of circuits. On the other hand, CLIPPER accuracy could be compromised by pathways with many loops, given that this algorithm breaks the loops to linearize the circuits.

Mechanistic biomarkers and systems medicine

Despite its unquestionable clinical utility, single-gene biomarkers and gene signatures are not exempt of problems [3, 4] due to the lack of any mechanistic link to the fundamental cellular processes that account for the mechanism of the disease [84]. Today’s empirical methods of diagnosis and therapeutic decision-making need to be transformed in ways that consider important facts such as the heterogeneity of the patients [86] and the modular nature of diseases [7], which are overlooked today. To achieve this transformation, the current method of classifying diseases needs to be modified from a phenotypic description to one that incorporates the different molecular drivers that created the observed phenotype. This can only be achieved through a deeper, systems-based understanding of these disease drivers [84]. Within this context, a mechanistic biomarker would be a multigenic biomarker that uses genes involved in the mechanism of a specific phenotype to estimate an activity value for cell functionalities related to, or directly accounting for, such a mechanism. The prediction of a complex phenotype such as patient survival based on a mechanism that is conceptualized as a mathematical model of the activity of the JNK signaling subpathway represents an example of a mechanistic biomarker [35]. MPA methods open the door to the systematic definition and use of mechanistic biomarkers [35]. Moreover, mechanistic biomarkers are not only diagnostic but also can be actionable entities that can efficiently and realistically address the problem of patient heterogeneity [86]. In fact, the concept of actionability within a systems medicine context is related to the activity of the mechanistic biomarker, rather than with the specific activity of a particular gene, and could require distinct interventions in different patients [35, 40]. Thus, MPA methods can be used to suggest and predict the effect of interventions (Knock Outs (KOs), drug inhibitions, over-expressions, etc.) on specific genes within circuits so as to find suitable clinical targets, predict side effects, speculate off-target activities, etc. This is the case of the PathAct application [87] that uses the HiPathia algorithm [40] to explore the effects of interventions that are simulated by directly changing the expression level of the genes, which can be considered an in silico equivalent to KOs, drug inhibitions or gene over-expressions, and can be carried out as individual interventions or in combinations. The use of actionable mechanistic biomarkers can pave the way for personalized and individualized therapies, especially in cancer, where many targeted therapies are already available. Another obvious area of application is drug discovery [83], where patient heterogeneity is behind the failure of many drugs [88]. In addition, the systems biology perspective provided by MPA approaches can be used to address other biological problems with an unprecedented mechanistic detail. An interesting problem is the assumption that the use of postmortem samples for transcriptomic studies provides a good representation of the antemortem transcriptome [81], despite the fact that changes in gene expression levels are expectable and actually little is known about how death and the postmortem cold ischemia can affect gene expression measurements [81, 82]. A recent study using data available from the GTEx project [80] revealed important changes of blood transcriptome after death and subsequent postmortem cold ischemia with relevant functional consequences such as deactivation of the immune system, metabolic changes as response to hypoxia, DNA synthesis arrest, diverse responses to stress and the activation of blood coagulation [79].

Conclusion

MPA methods constitute an evolution of pathway analysis methods in which pathways are decomposed into elementary subpathways or circuits that potentially account for cell outcomes that can help to explain mechanistic features of phenotypes (disease mechanism, drug MoA, etc.). Here we provide a review of MPA methods that include a limited benchmarking of sensitivity and specificity. From this comparison we concluded that, although most of the methods were highly specific, they presented remarkable differences in terms of sensitivity. From their relative performances it can be concluded that a biologically realistic definition of the circuits within the pathways analyzed is a major determinant of the success of the method. However, the scoring methodology, which accounts for the activity of the circuit, must also be representative of the biological activity of the cell. Thus, the propagation method used by HiPathia seems to be the most efficient solution, followed by scores based on differential gene expression, implemented by subSPIA, DEGraph and TAPPA. In any case, MPA methods have demonstrated to be more sensitive than the conventional functional analysis (ORA or GSEA) and represent a promising alternative for the interpretation of genomic measurements. Actually, the increasing importance of systems medicine to face the challenges of diagnosis and treatment of complex diseases [45] will increase the relevance and make more mainstream the use of approaches such as MPA methods. Key Points MPA methods focus on the activity of pathway substructures defined in different ways. Biologically inspired pathway substructures like receptor-to-effector circuits and methods to score their activity are relevant for the performance of MPA methods. MPA can help to discover actionable mechanistic biomarkers.

Funding

This work was supported by grants SAF2017-88908-R from the Spanish Ministry of Economy and Competitiveness and “Plataforma de Recursos Biomoleculares y Bioinformáticos” PT13/0001/0007 from the ISCIII, both co-funded with European Regional Development Funds (ERDF), and EU H2020-INFRADEV-1-2015-1 ELIXIR-EXCELERATE (ref. 676559).

85 in total

1. Onto-Tools, the toolkit of the modern biologist: Onto-Express, Onto-Compare, Onto-Design and Onto-Translate.

Authors: Sorin Draghici; Purvesh Khatri; Pratik Bhavsar; Abhik Shah; Stephen A Krawetz; Michael A Tainsky
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

2. Network enrichment analysis in complex experiments.

Authors: Ali Shojaie; George Michailidis
Journal: Stat Appl Genet Mol Biol Date: 2010-05-22

Review 3. Genomics and transcriptomics in drug discovery.

Authors: Joaquin Dopazo
Journal: Drug Discov Today Date: 2013-06-15 Impact factor: 7.851

4. Multidimensional gene set analysis of genomic data.

Authors: David Montaner; Joaquín Dopazo
Journal: PLoS One Date: 2010-04-27 Impact factor: 3.240

5. What would you do if you could sequence everything?

Authors: Avak Kahvejian; John Quackenbush; John F Thompson
Journal: Nat Biotechnol Date: 2008-10 Impact factor: 54.908

6. AlzPathway, an Updated Map of Curated Signaling Pathways: Towards Deciphering Alzheimer's Disease Pathogenesis.

Authors: Soichi Ogishima; Satoshi Mizuno; Masataka Kikuchi; Akinori Miyashita; Ryozo Kuwano; Hiroshi Tanaka; Jun Nakaya
Journal: Methods Mol Biol Date: 2016

7. Sensitive detection of pathway perturbations in cancers.

Authors: Corban G Rivera; Brett M Tyler; T M Murali
Journal: BMC Bioinformatics Date: 2012-03-21 Impact factor: 3.169

8. Empirical comparison of structure-based pathway methods.

Authors: Maria K Jaakkola; Laura L Elo
Journal: Brief Bioinform Date: 2015-07-21 Impact factor: 11.622

9. High throughput estimation of functional cell activities reveals disease mechanisms and predicts relevant clinical outcomes.

Authors: Marta R Hidalgo; Cankut Cubuk; Alicia Amadoz; Francisco Salavert; José Carbonell-Caballero; Joaquin Dopazo
Journal: Oncotarget Date: 2017-01-17

10. Graphite Web: Web tool for gene set analysis exploiting pathway topology.

Authors: Gabriele Sales; Enrica Calura; Paolo Martini; Chiara Romualdi
Journal: Nucleic Acids Res Date: 2013-05-10 Impact factor: 16.971

12 in total

1. A systematic comparison of data- and knowledge-driven approaches to disease subtype discovery.

Authors: Teemu J Rintala; Antonio Federico; Leena Latonen; Dario Greco; Vittorio Fortino
Journal: Brief Bioinform Date: 2021-11-05 Impact factor: 11.622

Review 2. Hybrid modelling of biological systems: current progress and future prospects.

Authors: Fei Liu; Monika Heiner; David Gilbert
Journal: Brief Bioinform Date: 2022-05-13 Impact factor: 13.994

3. Discovering potential interactions between rare diseases and COVID-19 by combining mechanistic models of viral infection with statistical modeling.

Authors: Macarena López-Sánchez; Carlos Loucera; María Peña-Chilet; Joaquín Dopazo
Journal: Hum Mol Genet Date: 2022-06-22 Impact factor: 5.121

4. Common pathways and functional profiles reveal underlying patterns in Breast, Kidney and Lung cancers.

Authors: Sergio Romera-Giner; Zoraida Andreu Martínez; Francisco García-García; Marta R Hidalgo
Journal: Biol Direct Date: 2021-05-26 Impact factor: 4.540

5. Models of cell signaling uncover molecular mechanisms of high-risk neuroblastoma and predict disease outcome.

Authors: Marta R Hidalgo; Alicia Amadoz; Cankut Çubuk; José Carbonell-Caballero; Joaquín Dopazo
Journal: Biol Direct Date: 2018-08-22 Impact factor: 4.540

6. Differential metabolic activity and discovery of therapeutic targets using summarized metabolic pathway models.

Authors: Cankut Çubuk; Marta R Hidalgo; Alicia Amadoz; Kinza Rian; Francisco Salavert; Miguel A Pujana; Francesca Mateo; Carmen Herranz; Jose Carbonell-Caballero; Joaquín Dopazo
Journal: NPJ Syst Biol Appl Date: 2019-03-01

7. Mechanistic Models of Signaling Pathways Reveal the Drug Action Mechanisms behind Gender-Specific Gene Expression for Cancer Treatments.

Authors: Cankut Çubuk; Fatma E Can; María Peña-Chilet; Joaquín Dopazo
Journal: Cells Date: 2020-06-29 Impact factor: 6.600

8. Mechanistic modeling of the SARS-CoV-2 disease map.

Authors: Kinza Rian; Marina Esteban-Medina; Marta R Hidalgo; Cankut Çubuk; Matias M Falco; Carlos Loucera; Devrim Gunyel; Marek Ostaszewski; María Peña-Chilet; Joaquín Dopazo
Journal: BioData Min Date: 2021-01-21 Impact factor: 2.522

9. A versatile workflow to integrate RNA-seq genomic and transcriptomic data into mechanistic models of signaling pathways.

Authors: Martín Garrido-Rodriguez; Daniel Lopez-Lopez; Francisco M Ortuno; María Peña-Chilet; Eduardo Muñoz; Marco A Calzado; Joaquin Dopazo
Journal: PLoS Comput Biol Date: 2021-02-11 Impact factor: 4.475

10. Using mechanistic models for the clinical interpretation of complex genomic variation.

Authors: María Peña-Chilet; Marina Esteban-Medina; Matias M Falco; Kinza Rian; Marta R Hidalgo; Carlos Loucera; Joaquín Dopazo
Journal: Sci Rep Date: 2019-12-12 Impact factor: 4.379