Mathias Cardner1,2, Nathalie Meyer-Schaller3, Gerhard Christofori3, Niko Beerenwinkel1,2. 1. Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland. 2. SIB Swiss Institute of Bioinformatics, Basel, Switzerland. 3. Department of Biomedicine, University of Basel, Basel, Switzerland.
Abstract
MOTIVATION: In order to infer a cell signalling network, we generally need interventional data from perturbation experiments. If the perturbation experiments are time-resolved, then signal progression through the network can be inferred. However, such designs are infeasible for large signalling networks, where it is more common to have steady-state perturbation data on the one hand, and a non-interventional time series on the other. Such was the design in a recent experiment investigating the coordination of epithelial-mesenchymal transition (EMT) in murine mammary gland cells. We aimed to infer the underlying signalling network of transcription factors and microRNAs coordinating EMT, as well as the signal progression during EMT. RESULTS: In the context of nested effects models, we developed a method for integrating perturbation data with a non-interventional time series. We applied the model to RNA sequencing data obtained from an EMT experiment. Part of the network inferred from RNA interference was validated experimentally using luciferase reporter assays. Our model extension is formulated as an integer linear programme, which can be solved efficiently using heuristic algorithms. This extension allowed us to infer the signal progression through the network during an EMT time course, and thereby assess when each regulator is necessary for EMT to advance. AVAILABILITY AND IMPLEMENTATION: R package at https://github.com/cbg-ethz/timeseriesNEM. The RNA sequencing data and microscopy images can be explored through a Shiny app at https://emt.bsse.ethz.ch. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: In order to infer a cell signalling network, we generally need interventional data from perturbation experiments. If the perturbation experiments are time-resolved, then signal progression through the network can be inferred. However, such designs are infeasible for large signalling networks, where it is more common to have steady-state perturbation data on the one hand, and a non-interventional time series on the other. Such was the design in a recent experiment investigating the coordination of epithelial-mesenchymal transition (EMT) in murine mammary gland cells. We aimed to infer the underlying signalling network of transcription factors and microRNAs coordinating EMT, as well as the signal progression during EMT. RESULTS: In the context of nested effects models, we developed a method for integrating perturbation data with a non-interventional time series. We applied the model to RNA sequencing data obtained from an EMT experiment. Part of the network inferred from RNA interference was validated experimentally using luciferase reporter assays. Our model extension is formulated as an integer linear programme, which can be solved efficiently using heuristic algorithms. This extension allowed us to infer the signal progression through the network during an EMT time course, and thereby assess when each regulator is necessary for EMT to advance. AVAILABILITY AND IMPLEMENTATION: R package at https://github.com/cbg-ethz/timeseriesNEM. The RNA sequencing data and microscopy images can be explored through a Shiny app at https://emt.bsse.ethz.ch. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
In order to learn biological networks, gene silencing techniques such as RNA interference provide a means to infer causal interactions. This is often coupled with high-throughput sequencing technologies such as RNA sequencing to measure the mRNA expression of genes in perturbed cells. In the literature, methods for learning regulatory networks from gene expression data tend to rely on the assumption that the expression level of a given gene is a reliable proxy for the activity of the protein which it encodes (Friedman, 2004; Pe’er ; Segal ). The same holds true for more recent approaches incorporating perturbation data (Peters ) and time series (Chen ). This is appropriate in transcriptionally regulated networks, where changes in gene expression account for all causal interactions. However, in networks consisting of transcription factors and microRNAs, post-transcriptional regulation and post-translational modifications play a major role in signalling. In such a case of non-transcriptional regulation, we can model the regulators as hidden variables, and attempt to infer their network structure through their perturbation effects on the transcriptome. The nested effects model (NEM) (Markowetz ) was designed for this very purpose.In the original NEM framework (Markowetz ), the signalling network is considered static, and samples are assumed to be measured at a time when the perturbed system has reached a steady state. The dynamic NEM (Anchang ) is an attempt to include time series from perturbation experiments in the NEM framework. However, this method still relies on the steady state network as a basis for inference, and the main benefit is the ability to prune transitively closed edges from the static network. A more comprehensive treatment of dynamic NEM (Fröhlich ) unrolls the NEM over time—similar to the idea of a dynamic Bayesian network—which can resolve feedback loops in the network. The inferred network is still static, but the time required for propagation of signals is allowed to vary. In contrast, the hidden Markov NEM (Wang ) allows for the network structure itself to change over time.All the above approaches require time series from perturbation experiments. In other words, the perturbed system must be measured at several time points until it reaches a steady state. While this experimental setup is possible for smaller studies containing few regulators, it becomes infeasible for larger networks. In the latter case, it is more common to have steady state data from perturbation experiments on the one hand, and a time series from an unperturbed system on the other. In such a setting we cannot reliably infer a dynamic network, nor resolve feedback loops in a static network. Instead, we propose to use the perturbation data to infer a steady state network, and then map the time series onto that network in order to assess the progression to the steady state. Given the differential expression profile at time point t versus the steady state, we try to find the most likely signalling state of the static network to yield such an effect profile.Our method is motivated by a recent experiment investigating epithelial–mesenchymal transition (EMT) in murine mammary gland cells (Meyer-Schaller ). In cancer, EMT plays an important role in the development of metastasis, which is the major cause of death for patients with cancer (Chaffer ). Cells in the epithelial state tend to be stationary, whereas cells in the mesenchymal state are migratory and invasive, which enables cancer cells to enter the blood stream and to seed distant metastases. On the other hand, the reverse transition from a mesenchymal to an epithelial state (MET) is believed to boost the outgrowth of seeded metastatic cancer cells in distant organs (Brabletz, 2012). However, recent findings indicate that full EMT and MET processes may not be required for metastasis, but rather that transitions to intermediate EMT stages could be sufficient (Pastushenko ).EMT entails major changes in the transcriptional and epigenetic landscape of a cell, and multiple factors regulating this transition have been identified. In particular, the interplay between Snail and Zeb transcription factors with microRNAs miR-34 and miR-200 is believed to regulate the acquisition of intermediate EMT states (Nieto ). A recent screening of >1500 gene regulators in normal murine mammary gland (NMuMG) cells led to the identification of 46 transcription factors and 13 microRNAs which were necessary for EMT to occur (Meyer-Schaller ). Here, we aimed at inferring a signalling network of those 59 regulators and estimating the signal progression through the network during the transition from the epithelial to the mesenchymal state.Treating epithelial cells with transforming growth factor beta (TGF-β) triggers a signalling cascade which causes the cells to transition into a mesenchymal state (Fig. 1a). If any of the crucial 59 transcription factors and microRNAs is perturbed before and during the TGF-β treatment, then the cells do not reach the mesenchymal state, but are instead halted at an intermediate stage (Fig. 1b). We hypothesized that perturbing a relevant transcription factor or microRNA would leave an informative record on the transcriptome. That is, when comparing the transcriptome measured in perturbation experiments versus a control experiment (all sampled 96 h after TGF-β treatment), regulators which are crucial early in the signalling cascade should show more extensive differential expression than regulators which play a role late in the cascade. Furthermore, we would expect to find a subset structure in the perturbation effects, due to signal propagation in the underlying pathway (Markowetz ). We used a NEM (Fröhlich ; Markowetz ) to infer a signalling network from downstream perturbation effects based on RNA sequencing data from RNA interference experiments sampled after 96 h of TGF-β treatment (Fig. 1c). Following the hypothesis above, we expect that genes directly regulated by a transcription factor or microRNA crucial at for instance 24 h, would show similar differential expression between perturbation and control (both sampled at 96 h), as they do in an unperturbed time course when comparing 24 h with 96 h. Based on this assumption, we developed a method for inferring signal progression through the pathway, in order to infer the time point at which each regulator is necessary for EMT to progress (Fig. 1d). Our model (Section 2.3) thus integrates a non-interventional time series with perturbation data, and is formulated as an integer linear programme which can be solved efficiently using heuristic algorithms.
Fig. 1.
Synopsis. (a) Normal cells undergo EMT induced by TGF-β. (b) EMT is halted in cells perturbed by RNAi against any of the 59 regulators under investigation. Despite TGF-β treatment to induce EMT, the perturbed cells do not reach the mesenchymal phenotype, but are halted at intermediate stages. (c) Nested effects model to infer a non-transcriptional signalling network from downstream perturbation effects on genes measured using RNA sequencing. (d) Inferring the signal progression during an unperturbed EMT time course
Synopsis. (a) Normal cells undergo EMT induced by TGF-β. (b) EMT is halted in cells perturbed by RNAi against any of the 59 regulators under investigation. Despite TGF-β treatment to induce EMT, the perturbed cells do not reach the mesenchymal phenotype, but are halted at intermediate stages. (c) Nested effects model to infer a non-transcriptional signalling network from downstream perturbation effects on genes measured using RNA sequencing. (d) Inferring the signal progression during an unperturbed EMT time course
2 Materials and methods
We analysed data gathered in the experiments described by Meyer-Schaller . In short, transforming growth factor beta (TGF-β) was used to induce EMT in epithelial NMuMG cells, but RNA interference (RNAi) against certain regulators prevented the cells from reaching the mesenchymal state. A screening of approximately 1500 RNAi interventions identified 46 transcription factors (TFs) and 13 microRNAs (miRNAs) which were necessary for EMT to occur. Our aim was to infer the signalling network of those 59 regulators. Although direct regulation between TFs themselves and between TFs and miRNAs can occur, the signalling activity of the 59 regulators can also be regulated on post-transcriptional level, for instance by the miRNAs interfering with protein translation (Wang ). Moreover, TF activities can be affected indirectly through post-translational modification and nuclear localization. Therefore, modelling interactions between multiple TFs and miRNAs is complex and poorly understood (Le ).We obtained RNA sequencing data from two separate experimental set-ups on epithelial normal murine mammary gland (NMuMG) cells; namely one EMT time course and one set of 59 RNAi perturbation experiments. For the time course, epithelial NMuMG cells were treated with TGF-β and subsequently sampled at 12 time points between 0 and 10 days. Already after 4 days, the cells exhibited a mesenchymal phenotype (Fig. 1a). The experiment was performed in biological duplicates.For each of the 46 transcription factors, Ambion siRNAs were used to knock down the transcript coding for the given TF. For each of the 13 miRNAs, RNAi amounted to over-expression of the given miRNA. RNAi was performed against the regulators of interest, starting 2 days before TGF-β treatment, so that the RNAi would have time to take effect before EMT was induced. After 4 days of TGF-β treatment, the cells were sampled and sequenced. At this time point, the cells had been halted in an intermediate stage between epithelial and mesenchymal states (Fig. 1b). In addition, control samples with inert (non-targeting) siRNA or miRNA transfections were sampled after 0 and 4 days of TGF-β treatment. All experiments were performed in biological duplicates.
2.1 The standard NEM
We here outline the original nested effect model (Markowetz , 2007), in order to introduce concepts and notation required to explain how we integrate perturbation data with a non-interventional time series (Section 2.3).Figure 2 illustrates the structure of a NEM. The goal is to infer a signalling network Φ, represented as a directed acyclic graph involving the regulators, also referred to as signalling nodes (S-nodes), denoted S for . The classic NEM framework uses the term ‘S-genes’, which can be misleading, since we here refer to gene products, such as transcription factors. Furthermore, in our case 13 of the regulators are miRNAs, which is why we use the term ‘S-nodes’. The signalling nodes are individually perturbed in separate RNAi experiments, but their activity is not directly measured. Instead, we measure the expression level of effect reporter genes (E-genes) denoted E for , which are affected by the perturbation of S-nodes. The attachment of effect genes to regulators is encoded by Θ, and the network reconstruction is based on an assumption of parsimony. Namely, if an E-gene responds to the perturbation of two S-nodes, we assume that it is directly regulated by only one of them, which acts as a mediator for the other regulator. For instance, in Figure 2, S1 and S2 both regulate {E1, E2}. However, since the effects of perturbing S1 are a subset of the effects of perturbing S2, we deduce that S1 is downstream of S2 in the signalling pathway, and we infer the edge . Thus, the regulators are hierarchically ordered so that the effects of a child node are a subset of the effects of its parent. Due to this nesting of effects, we implicitly assume that the signalling network Φ is transitively closed; that is, if , then . In practice, we need to account for noise in the data, which requires a probabilistic model.
Fig. 2.
Nested effects model. The S-nodes are modelled as hidden variables, and we aim to infer their causal graph Φ (solid arrows). In experiments separately perturbing each S-node, we observe the differential expression of all E-genes. Assuming that each E-gene is directly regulated by at most one S-node in Φ, we compute the maximum a posteriori attachment Θ (dashed arrows) of effect genes to S-nodes. We search for the signalling graph Φ which yields the most likely probabilistic nesting of effects
Nested effects model. The S-nodes are modelled as hidden variables, and we aim to infer their causal graph Φ (solid arrows). In experiments separately perturbing each S-node, we observe the differential expression of all E-genes. Assuming that each E-gene is directly regulated by at most one S-node in Φ, we compute the maximum a posteriori attachment Θ (dashed arrows) of effect genes to S-nodes. We search for the signalling graph Φ which yields the most likely probabilistic nesting of effectsLet D be the data matrix such that D records how E-gene E was affected by the perturbation of S-node S. (More details are given in Section 2.2) Suppose that we have a candidate network Φ, which is a directed acyclic graph (DAG) of S-nodes . We are ultimately interested in the posterior probability of the model:
where the denominator does not depend on Φ, and can be disregarded for model comparison. Since almost nothing is known about the signalling network under our investigation, we use a uniform prior P(Φ). Thus we focus entirely on the likelihood . It can be computed by marginalizing over E-gene attachments Θ, or by using the maximum a posteriori (MAP) estimate of Θ (Tresch and Markowetz, 2008). We will follow the latter approach, and thus solve the following:The outer maximization is done by evaluating candidate DAGs in a heuristic way (see Section 2.2). The inner maximization over Θ is performed as follows.Let denote the attachment of E-genes to the signalling nodes, such that θ = j if E is attached to S. We assume that the attachments of E-genes are conditionally independent given the signalling network Φ, so that . Furthermore, we assume that are conditionally independent given Φ and Θ. Thus, we do not assume genes to be marginally independent, but rather conditionally independent given their attachments to the regulators. Hence,The full derivations are provided in the Supplementary Material.
2.2 Scoring and searching
In order to judge from noisy observations whether a certain E-gene is affected by the perturbation of a given S-node, we used the R (R Core Team, 2018) package DESeq2 (Love ) to perform differential expression analysis of RNA sequencing data. Each perturbed condition was compared with an unperturbed control, yielding a P-value for each gene, with the null hypothesis of no differential expression. The data matrix entry D is the P-value corresponding to gene E under perturbation of S. The alternative hypothesis of differential expression is central to our analysis. We therefore fit a beta-uniform mixture model to the distribution of P-values across genes in a given condition (Pounds and Morris, 2003). Since the null distribution of a P-value is uniform, the density function of the alternative hypothesis can be estimated from the fitted mixture components (Fröhlich ); see details in Supplementary Material.For each perturbation experiment k, we estimated the probability density function f of P-values under the alternative hypothesis, such that:
where denotes the adjacency matrix of the candidate network Φ. In words, if Φ predicts an effect on E when perturbing S (given θ = j), we model the P-value D using the alternative density f; otherwise we model it with the uniform density 1. With this definition, we can evaluate Equation (1).Since the number of DAGs grows super-exponentially with the number of nodes, an exhaustive enumeration is not feasible for signalling networks with more than four or five S-nodes. In order to infer larger networks, we must apply heuristic approaches. Specifically, we used a divide-and-conquer method called module network (Fröhlich ), which is tailored to NEMs. It exploits the fact that the NEM objective is to find a subset structure among effect genes. Thus, the module network approach subdivides the problem by initially performing hierarchical clustering based on correlation distance of effect profiles, in order to group signalling nodes into modules containing at most four S-nodes. Then, an exhaustive search is performed within each module. Finally, modules are joined by the highest scoring pairwise connections between modules.
2.3 Mapping a time series onto a static network
Suppose that we have inferred a signalling network from a perturbed system sampled upon reaching a steady state at time t after applying a pathway stimulus. Now we aim to estimate the signal progression at a time point t < t from an unperturbed time series sampled after application of the same stimulus. We can compute the maximum a posteriori attachment of E-genes to S-nodes in (see details in Supplementary Material). The regulatory module of a signalling node S is the set of E-genes whose MAP attachment is to S, that is the set . Using the unperturbed time series, we consider differential expression of E-genes in a given regulatory module as a proxy for the activity of its direct regulator.For a given time point , we perform a differential gene expression analysis contrasted with the steady-state time point t. Let the P-value profile of the differential expression tests at time point t versus t be denoted by , where m is the number of E-genes. Thus is the P-value of the ith gene from the differential expression test contrasting t with the steady state t. Based on this, we want to estimate the most likely state of the signalling network at time t. In other words, we are asking what signalling activities (perturbed or not) of the S-nodes in are most likely giving rise to the differential expression profile which we observe. This implicitly rests on the assumption that the n S-nodes in sufficiently explain the observed changes to the transcriptome, and that there is no substantive confounding by latent variables.Let record the binary signalling states of the n regulators, where K = 1 if S is perturbed and K = 0 otherwise. Given the above assumption, gene E is not differentially expressed if and only if pa(E) is not perturbed under K. The former is the null hypothesis of the differential expression test, and the latter is a statement about the activity of S-nodes in . In keeping with the transitive closure and propagation of signals in , gene E is expected to be differentially expressed if and only if any of the ancestors of E in —denoted anc(E)—is perturbed under K. Thus, based on the same reasoning as in Equation (2),
where f is the alternative density of the differential expression test comparing time point t with the steady-state time point t. Again assuming conditional independence of E-genes given their attachment to the signalling network , we have,
which we seek to maximize with respect to K. This is equivalent to solving:Due to the transitively closed nature of , we need only search for solutions K which are consistent with , in the sense that if K = 1, then K = 1 for every node S downstream of S in . Given this constraint, the sum in Equation (5) can be written as follows:With the problem in Equation (5) is therefore equivalent to maximizing given the constraint that if K = 1, then K = 1 for every descendant S of S in . This constraint can be formulated as for a matrix A which we construct as follows. For each edge in the transitive reduction of , we create a row such that A = 1 and , with for . The optimization problem in Equation (5) can thus be formulated as a binary integer linear programme with canonical form:where K is a binary vector. We used the R package lpSolve (Berkelaar ) to solve this problem.
2.4 Validation experiments
Luciferase reporter assays were used to assess the accuracy of the static NEM signalling network edge predictions in NMuMG epithelial cells which underwent EMT upon the addition of TGF-β. The assays were performed as previously reported by Meyer-Schaller . In short, RNAi was used to down-regulate a transcription (co)factor by siRNA, or to over-express a miRNAs. One day later, TGF-β was added to the cells to induce EMT, followed by retransfection (day 2) and transfection of a Firefly luciferase reporter 2 days after EMT induction (day 3) together with a Renilla luciferase reporter to normalize for cell counts. After 4 days of TGF-β treatment (day 5), cells were assayed for luciferase activity. To measure Smad4 or Sox4 activities, Firefly luciferase reporters with Smad3/4 binding sequence (Dennler ) or seven specific Sox4 binding sites (Meyer-Schaller ) were used, respectively. The activity of the Notch co-activator Maml2 was assessed with a Notch reporter containing specific Rbpjκ response elements (Promega, CS173601).
3 Results
We used the Bioconductor R package nem (Froehlich ) to infer a static signalling network from the RNAi perturbation data. In order to base the analysis on relevant genes, we included only genes which were differentially expressed (FDR < 5%) during the time course, provided that they were also differentially expressed in at least one perturbation experiment. This left us with 11 270 E-genes. Furthermore, the nem fitting procedure included a so-called null node, to which uninformative genes were attached. In order to assess the robustness of the inferred network, we performed bootstrap with 1000 resamples, each time resampling all E-genes with replacement, and inferring a network. Additionally, we ran a jackknife procedure, leaving out one signalling node at a time, and inferring the remaining network containing 58 regulators. We summarize the results in Figure 3, which aggregates the transitively closed adjacency matrices of the inferred networks.
Fig. 3.
Adjacency matrix showing the average of bootstrap and jackknife supports for each edge in the graph. Parent nodes are topologically ordered (from root to leaf), and child nodes follow the same ordering. The entry in row i and column j records the average of the bootstrap and jackknife supports for the directed edge
Adjacency matrix showing the average of bootstrap and jackknife supports for each edge in the graph. Parent nodes are topologically ordered (from root to leaf), and child nodes follow the same ordering. The entry in row i and column j records the average of the bootstrap and jackknife supports for the directed edgeFor each edge, its bootstrap support is the proportion of inferred networks (out of 1000) in which the given edge is present. Analogously, the jackknife support of a given edge is the proportion of networks (out of 57) in which that particular edge is present. (The proportion is out of 57 rather than 59, since for each edge between a pair of nodes, there were precisely two jackknife runs where either of two nodes was left out.) Each cell in the adjacency heatmap corresponds to the average of the bootstrap and jackknife supports for that given edge. We used the average because we wanted both procedures to contribute equally to the consensus, and the bootstrap would otherwise dominate by virtue of having 1000 runs, whereas the jackknife is restricted by the number of S-nodes. Figure 4 shows the consensus network consisting of the edges whose average support exceeds 50%. To aid visualization, the transitive reduction of the graph is displayed, meaning that we omit redundant edges implied by transitively closing the graph.
Fig. 4.
Transitive reduction of the consensus network, which contains edges whose average of bootstrap and jackknife supports exceeds 50%. Nodes are topologically ordered so that all edges point downward. Transcription factors are coloured blue and miRNAs orange
Transitive reduction of the consensus network, which contains edges whose average of bootstrap and jackknife supports exceeds 50%. Nodes are topologically ordered so that all edges point downward. Transcription factors are coloured blue and miRNAs orange
3.1 Validation of the static network
The transcription factors whose activity was measured by reporter assays were selected from the top (Sox4), middle (Smad4) and bottom (Maml2 via Rbpjκ) of the inferred signalling hierarchy (Fig. 4). For each reporter assay, RNAi perturbation experiments were performed against regulators selected uniformly from the hierarchy. Each such experiment assessed the directed edge from the perturbed regulator (parent node) to the transcription factors measured by the reporter assay (child node). By comparing the validated effects (Fig. 5) with the predicted edges of the NEM (Fig. 3), we estimated precision and recall of the inferred signalling graph. In total, 53 NEM edge predictions were experimentally assessed. For the Smad4 and Sox4 reporter assays, the precision and recall is 75% and 62.5%, respectively. If including the Rbpjκ reporter as a proxy for Maml2 activity, the overall precision and recall is 67.6% and 73.5%, respectively.
Fig. 5.
Summary of luciferase reporter assays for Rbpj, Smad4 and Sox4. The volcano plots are based on Welch’s t-tests of perturbation versus control, after removing batch effects using a linear model. Experiments above the dotted lines have FDR-adjusted P-values <5%, and are labelled. Predictions by the NEM consensus network (Fig. 4) are indicated by green triangles for a predicted effect, and orange circles for a prediction of no effect. Part of this validation dataset was previously published by Meyer-Schaller
Summary of luciferase reporter assays for Rbpj, Smad4 and Sox4. The volcano plots are based on Welch’s t-tests of perturbation versus control, after removing batch effects using a linear model. Experiments above the dotted lines have FDR-adjusted P-values <5%, and are labelled. Predictions by the NEM consensus network (Fig. 4) are indicated by green triangles for a predicted effect, and orange circles for a prediction of no effect. Part of this validation dataset was previously published by Meyer-Schaller
3.2 Signal progression during EMT
Principal component analysis of the time-course samples shows that the gradual progression of EMT during the time course was captured on a transcriptomic level (Fig. 6). The first principal component accounts for 68% of the variance in the data, and almost perfectly orders the samples chronologically. Exceptions are the samples taken at 60 h, which have scores on PC1 most similar to the scores of the samples taken at 36 h. The 60 h samples also have extreme scores with respect to PC3.
Fig. 6.
Principal component analysis based on expression of 24 454 genes in biologically duplicated samples taken during an EMT time course. Top panel: PC1 versus PC2, bottom: PC2 versus PC3. The data was normalized using a regularized logarithmic transform (Love ), and batch effects were removed using a linear model (Ritchie )
Principal component analysis based on expression of 24 454 genes in biologically duplicated samples taken during an EMT time course. Top panel: PC1 versus PC2, bottom: PC2 versus PC3. The data was normalized using a regularized logarithmic transform (Love ), and batch effects were removed using a linear model (Ritchie )We used our method described in Section 2.3 to map the time-course data onto the inferred network. For a given time point , we computed the differential expression profile between t and 96 h, which represents the mesenchymal stage. Then, we estimated which states of the S-nodes are most likely to produce the differential expression profile which we observed. The procedure was performed on the inferred network and its MAP E-gene attachment for every bootstrap and jackknife resample. For each t, this yielded 1000 + 59 estimates of K, which were aggregated by taking the component-wise (node-wise) average. The result is shown in Figure 7, where an S-node is uncoloured if it is inferred to be in a signalling state different to that at 96 h. Otherwise it is coloured green. Thus the colour of a given S-node indicates the time point at which its signalling activity is inferred to differ or agree with that at the mesenchymal state. The colour gradient was determined by running the integer linear programme on all networks inferred across bootstrap and jackknife resamples, followed by aggregating the results.
Fig. 7.
Inferred signal progression at hours after TGF-β treatment. An S-node is uncoloured if it is inferred to be in a different signalling state at t versus 96 h; otherwise it is coloured green. The colour gradient comes from aggregating the inferred activity states over the networks resulting from 1000 bootstrap and 59 jackknife resamples. For illustration purposes, the signal progression is mapped onto the consensus network (Fig. 4)
Inferred signal progression at hours after TGF-β treatment. An S-node is uncoloured if it is inferred to be in a different signalling state at t versus 96 h; otherwise it is coloured green. The colour gradient comes from aggregating the inferred activity states over the networks resulting from 1000 bootstrap and 59 jackknife resamples. For illustration purposes, the signal progression is mapped onto the consensus network (Fig. 4)The progression to mesenchymal signalling activity starts with the S-nodes at the top of the hierarchy, as implied by enforcing the transitive closure in the integer linear programme. However, the same trend is observed even when not requiring the mapping to respect the transitive closure see Supplementary Material (Supplementary Fig. S1). As EMT progresses, the signal flows slowly from top to bottom of the network, except for the 60 h time point, and reaches the tail of the network by 72 h of TGF-β treatment. These results indicate that the NEM, although based on perturbation data from only one time point, can capture the temporal progression of signalling activities through the EMT system. Supplementary Figure S2 shows that signal progression would not be inferred by random chance, assessed by permuting gene labels before estimating signalling activity.
3.3 Computational complexity
The computational complexity of the module network algorithm for standard nested effect models has been addressed in previous articles (Fröhlich ), so we will not study it here. In our case of inferring a NEM with 59 S-nodes and 11 270 E-genes, the average run-time was 12.4 CPU hours (SD 1.72) on an IntelⓇ XeonⓇ Processor E5-2680 v3 (2.5–3.3 GHz). The average memory requirement was 36.3 GB (SD 3.37). In contrast, our method for mapping the EMT time-course data onto the inferred NEM takes less than a second. Thus the bottleneck is clearly the network inference.In order to assess the computational complexity of our proposed method for inferring signalling activity in a NEM, we simulated synthetic NEMs with n S-nodes and m E-genes, where values of n and m were varied systematically. For each synthetic network, we randomly perturbed S-nodes, and generated synthetic E-gene data from a beta-uniform mixture model. Then, we ran our algorithm to map the effect profile onto the NEM, and recorded the elapsed run-time. To assess the complexity dependence on n, we fixed m = 2000 and varied . Each configuration was run 100 times each, and the resulting box plots are shown in the left panel of Figure 8. Since the run-time is plotted on a logarithmic scale, the linear trend (regression slope 0.042) suggests that the time complexity grows exponentially with the number of S-nodes. To estimate the complexity as a function of m, we fixed n = 60 and varied . Each configuration was repeated 100 times, and the resulting box plots are shown in the right panel of Figure 8. Again the trend is roughly linear on the logarithmic scale (regression slope 5.8×10−5), though not as steep as that for n. Thus we see a time complexity which grows exponentially with the number of E-genes.
Fig. 8.
Computational complexity of our method for mapping an observational effect profile onto a static NEM. The CPU time in milliseconds is shown on a logarithmic scale on the y-axis. The number of S-nodes n varied from 10, 20, …, 100 (left panel). The number of E-genes m varied from 2000, 4000, …, 20 000 (right panel). Simulations were timed using the R package microbenchmark (Mersmann, 2018)
Computational complexity of our method for mapping an observational effect profile onto a static NEM. The CPU time in milliseconds is shown on a logarithmic scale on the y-axis. The number of S-nodes n varied from 10, 20, …, 100 (left panel). The number of E-genes m varied from 2000, 4000, …, 20 000 (right panel). Simulations were timed using the R package microbenchmark (Mersmann, 2018)
4 Discussion
We extended the NEM framework to map an unperturbed EMT time series onto a steady state signalling network inferred from RNAi perturbation experiments. Based on the static network, our method estimates the progression of signalling activity of 59 transcription factors and miRNAs during EMT. The methodology integrates an unperturbed time series with perturbation data, which opens up new avenues in modelling signalling activities on large datasets where time-resolved perturbation experiments are infeasible.EMT enables epithelial cells to become migratory and invasive, and requires a complex and coordinated flow of transcription (co)factor activities to reach a mesenchymal state. In the past, the acquisition of different EMT stages has been modelled using three- or four-factor systems, or based on activity predictions in the absence of perturbation data (Hong ; Steinway ; Zhang ). In contrast, in this study we made use of a large set of transcriptomic data from RNAi experiments of 59 transcription factors and miRNAs, which have been experimentally reported as essential EMT factors (Meyer-Schaller ). Using a NEM, we inferred a signalling network from the RNAi perturbation experiments. Extending the NEM framework, we mapped an unperturbed EMT time series onto the inferred network, in order to gauge the signal progression through the network during EMT. To validate the NEM network, we performed transcription factor reporter assays to measure TF signalling activities while perturbing upstream factors. These experiments yielded precision and recall estimates of 75% and 62.5%, respectively. Nevertheless, we were surprised that some important EMT factors, such as Smad4 or Zeb1 and Zeb2 (Lamouille ), appear rather low in the inferred signalling hierarchy, in the middle to the beginning of the ‘tail’ in Figure 4. The kinetic prediction of the signalling flow (Fig. 7) confirms an early activation of the top nodes in the network (up to 48 h TGF-β treatment) followed by a rapid activation of the bulk of the signalling network (72 h TGF-β treatment). Interestingly, the network flow reverts at the 60 h time point, coinciding with a similar divergence in the principal component analysis (Fig. 6), which otherwise captures the chronological EMT progression very accurately. TGF-β treatment of NMuMG cells results in TGF-β-induced cell-cycle arrest and induction of apoptosis before the cells convert to a mesenchymal state (Massagué, 2012). The partial reversion of TGF-β signalling after 60 h of treatment could reflect an important return point for the cells to overcome the apoptotic stimuli before continuing on the mesenchymal transition route.Zeb transcription factors are engaged in a well characterized feedback loop with the miR-200 family (Wellner ). Since the NEM inference is constrained to acyclic graphs, we cannot infer feedback loops. However, the NEM infers directed edges into Zeb2 from the miR-200 family members miR-200a, miR-200b and miR-429, as well as an edge from Zeb1 to Zeb2, and the acyclic constraint forces one to be inferred upstream of the other. This may explain why Zeb1 is not inferred to be downstream of the miR-200 family, since this would otherwise create a feedback loop between Zeb1 and Zeb2. Due to the lack of time resolution in the perturbation data, feedback loops cannot be resolved, and we must consider the topology of the inferred network as a snapshot of the pathway at the time of sampling.
4.1 Limitations and possible extensions of the methodology
The NEM framework and our extension of it (Section 2.3) both make strong assumptions on the structure of the data generating mechanism. As discussed in the introduction, those assumptions are made because we have data only on the transcriptomic effects of perturbing signalling nodes, whereas we lack data on the activity of the signalling nodes themselves. However, assuming nested effects, we can infer signalling activities of the network nodes without information of whether this signalling is direct or mediated by unknown factors. The alternative—to model signalling nodes by the abundances of the mRNAs coding for them—would only work in systems that are transcriptionally regulated, that is, where regulators directly bind to each other’s promoter regions. Admittedly, the model assumption of nested effects may be too strict for our application since it does not accommodate feedback loops, which have been reported to exist between Zeb and the miR-200 families (Nieto ). However, feedback loops are notoriously difficult to model in structure learning, with only a few exceptions where the form of the feedback is described by linear structural equation models (Drton and Maathuis, 2017). This class of models is not applicable in our case, since the variables of interest are hidden.An alternative approach to de novo network inference is to utilize prior knowledge of protein–DNA and protein–protein interactions (PPIs) (Yeang ). We did not pursue this approach for our EMT application in murine cells, because the current knowledge is too scarce. However, methods based on PPI networks have been developed with yeast as the model organism. For instance, Ourfali utilize a directed PPI network to search for signalling pathways connecting a perturbed gene with an affected target. To utilize undirected PPI networks whose edges are weighted by confidence, Gitter turn to the problem of orienting edges with the objective of maximizing the confidence in source-target paths of a bounded length. This method was also applied to incorporate time series (Gitter ). In a similar vein, Silverbush and Sharan (2014) aim to orient edges in an undirected PPI network while maximizing the number of directed source-target vertex pairs which are connected via shortest paths in the original graph. The above approaches all solve integer linear programmes (ILPs), though their objective is rather different from ours. Whereas those approaches solve ILPs to find paths or orient edges, our ILP takes the network inferred by the NEM as given, and instead aims to detect differential signalling based on a probabilistic measure of differential expression within regulatory modules.In our analysis, we did not take into account the direction of differential expression; that is, whether the log-fold change was positive or negative. Similarly, we did not quantify whether relationships in the signalling network were activating or inhibiting. Rather, we considered the statistical significance of differential expression and the presence of interactions between regulators, regardless of their sign. This is in line with the standard NEM framework, but there exists an extension called Factor Graph NEM (Vaske ) which natively models not only the presence of effects, but also their sign. This extension reformulates the NEM as a factor graph, and uses message passing to learn the optimal network. The authors reported that their extension performed better than the standard NEM in the presence of inhibitions in both the signalling graph and the effects. Since miRNAs tend to silence gene expression, and transcription factors can both activate and inhibit, it is possible that FG-NEM would yield different results for our data. However, in the absence of inhibition, the authors reported that FG-NEM, surprisingly, performs dramatically worse than the standard NEM. Furthermore, FG-NEM is designed only for knock-down experiments, and would therefore require some modification to appropriately model the over-expression of miRNAs, as were performed in our case.Since the number of directed acyclic graphs grows super-exponentially with the number of nodes, the computational bottleneck is to infer the static NEM, which is a prerequisite for our method. A simulation study showed that the computational complexity of our method for mapping a time series onto a signalling network is negligible compared to inferring the static NEM.Click here for additional data file.