| Literature DB >> 34841042 |
Alex Stivala1, Alessandro Lomi1,2.
Abstract
Analysis of the structure of biological networks often uses statistical tests to establish the over-representation of motifs, which are thought to be important building blocks of such networks, related to their biological functions. However, there is disagreement as to the statistical significance of these motifs, and there are potential problems with standard methods for estimating this significance. Exponential random graph models (ERGMs) are a class of statistical model that can overcome some of the shortcomings of commonly used methods for testing the statistical significance of motifs. ERGMs were first introduced into the bioinformatics literature over 10 years ago but have had limited application to biological networks, possibly due to the practical difficulty of estimating model parameters. Advances in estimation algorithms now afford analysis of much larger networks in practical time. We illustrate the application of ERGM to both an undirected protein-protein interaction (PPI) network and directed gene regulatory networks. ERGM models indicate over-representation of triangles in the PPI network, and confirm results from previous research as to over-representation of transitive triangles (feed-forward loop) in an E. coli and a yeast regulatory network. We also confirm, using ERGMs, previous research showing that under-representation of the cyclic triangle (feedback loop) can be explained as a consequence of other topological features. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s41109-021-00434-y.Entities:
Keywords: Biological network; ERGM; Exponential random graph model; Motif
Year: 2021 PMID: 34841042 PMCID: PMC8608783 DOI: 10.1007/s41109-021-00434-y
Source DB: PubMed Journal: Appl Netw Sci ISSN: 2364-8228
Fig. 1Triad census classes labeled with the MAN (mutual, asymmetric, null) dyad census naming convention. When the dyad census does not uniquely identify a triad, a letter designating “up”, “down”, “transitive”, or “cyclic” is appended
Fig. 2Motif examples. F, the transitive triangle (triad 030T) is not a special case of H, the out-star (triad 021D), when considered as motifs (or triad census classes): they are distinct induced subgraphs of three nodes. However, when considered as ERGM configurations, since H is a subgraph (but not an induced subgraph) of F (the transitive triangle is formed by “closing” the out-star with an additional arc), in their corresponding statistics both F and H are counted for an occurrence of F
Summary statistics for the biological networks
| Network | Directed | Nodes | Edges | Density | Clustering coefficient |
|---|---|---|---|---|---|
| Yeast PPI | No | 2617 | 11855 | 0.00346 | 0.46862 |
| Human PPI (HIPPIE) | No | 11517 | 47184 | 0.00071 | 0.03765 |
| Alon | Yes | 423 | 519 | 0.00291 | 0.02382 |
| Alon yeast regulatory | Yes | 688 | 1079 | 0.00228 | 0.01625 |
“Clustering coefficient” is the global clustering coefficient (transitivity)
Fig. 3Degree distributions of the networks. Power law and log-normal distributions fitted to the CDF for degree distributions of the networks (in- and out-degree for directed networks, degree for undirected networks). All distributions apart from the E. coli in-degree distribution (for which a log-normal distribution could not be fitted), and the Human PPI (HIPPIE) degree distribution (which is not consistent with a power law distribution, ), are consistent with both power law and log-normal distributions
Parameters for undirected networks
| Effect | Description |
|---|---|
| Edge | Baseline density |
| A2P | Alternating |
| AS | Alternating |
| AT | Alternating |
| Match | Categorical matching on categorical attribute |
Parameters for directed networks
| Effect | Description |
|---|---|
| Arc | Baseline density |
| Sink | A positive parameter value indicates a tendency for nodes with incoming but no outgoing arcs |
| Source | A positive parameter value indicates a tendency for nodes with outgoing but no incoming arcs |
| Reciprocity | A positive parameter value indicates a tendency for arcs to be reciprocated (a cycle of length 2) |
| AltInStars | Alternating |
| AltOutStars | Alternating |
| AltTwoPathsT | Multiple 2-paths. A positive parameter value indicates a tendency for directed paths of length 2. Used as a “control” for AltKTrianglesT, the parameter for triangles formed by closing these 2-paths |
| AltKTrianglesT | Path closure or transitive closure. A positive parameter value indicates a tendency for open directed two-paths to be closed transitively. This is an alternating statistic version of the “feed-forward loop” motif |
| AltKTrianglesC | Cyclic closure. A positive parameter value indicates a tendency for directed cycles of length 3 in the network, representing non-hierarchical network closure. An alternating statistic version of the “three-node feedback loop” motif |
| Sender | Sender on binary attribute |
| Receiver | Receiver on binary attribute |
| Interaction | Interaction on binary attribute |
| Matching | Matching on categorical attribute |
| Loop | Self-edge. A positive parameter value indicates a tendency for self-edges (loops) |
Fig. 4Alternating two-paths and alternating transitive triangles ERGM configurations for directed networks. Unlike motifs, ERGM configurations are not induced subgraphs, so it is normal (and often required) for one to be a subgraph of another. So AltTwoPathsT and AltKTrianglesT are frequently included in a model together, with AltKTrianglesT consisting of the AltTwoPathsT configuration “closed” by the addition of an arc
Parameter estimates with 95% confidence interval for the yeast PPI network, from the EE algorithm
| Effect | Model 1 | Model 2 | Model 3 |
|---|---|---|---|
| Edge | |||
| AS | |||
| A2P | – | ||
| AT | |||
| Match class | – | – |
Parameter estimates that are statistically significant are shown in bold
Parameter estimates with 95% confidence interval for the human PPI (HIPPIE high confidence) network, from the EE algorithm
| Effect | Model 1 | Model 2 |
|---|---|---|
| Edge | ||
| AS | ||
| A2P | ||
| AT | ||
| Match cellular component | – |
Parameter estimates that are statistically significant are shown in bold
Parameter estimates with 95% confidence interval for the Alon E. coli regulatory network
| Effect | Model 1 | Model 2 | Model 3 | Model 4 |
|---|---|---|---|---|
| Arc | ||||
| Sink | ||||
| Source | ||||
| AltInStars | ||||
| AltOutStars | ||||
| AltTwoPathsT | ||||
| AltKTrianglesT | ||||
| Matching self | – | – | – | |
| Loop | – | – | – |
Parameter estimates that are statistically significant are shown in bold. In Models 3 and 4, self-edges (loops) are retained in the network and allowed in the model
Fig. 5Alon E. coli regulatory network. (a) Node size is proportional to in-degree. (b) Node size is proportional to out-degree. Self-regulating operons are depicted as filled (red) circles. In (a) there appears to be a small set of high in-degree nodes and a much larger set of smaller in-degree nodes, while in (b) the out-degree of the nodes appears to be much more evenly distributed. The hypothesis we might make from (a), that there is centralization on in-degree, is confirmed by the ERGM results. This same model finds no support for the hypothesis we might make from (b), that there is a tendency against centralization on out-degree
Fig. 6Goodness-of-fit plots for the triad census of (a) the Alon E. coli regulatory network, Model 1 (Table 6), and (b) the Alon yeast regulatory network, Model 1 (Table 7). The observed triad counts are plotted in red with the triad counts of 100 simulated networks plotted as black boxplots. Because the triad counts (y-axis) are on a log scale, values of zero are omitted (observed zero counts shown as a red point on the bottom of the graph). In (a), for triad census class 030C (cyclic triad), the “box plot” consisting of a single median line for the simulated count represents a single (out of 100 simulations) occurrence of a nonzero count (of 1) for 030C
Parameter estimates with 95% confidence intervals for the Alon yeast regulatory network
| Effect | Model 1 | Model 2 | Model 3 |
|---|---|---|---|
| Arc | |||
| Reciprocity | — | — | |
| AltInStars | |||
| AltOutStars ( | |||
| AltTwoPathsT ( | |||
| AltKTrianglesT ( |
Parameter estimates that are statistically significant are shown in bold. In Model 1 only, estimation is conditional on no reciprocated arcs, even though there is a single reciprocated arc in the data. Model 3 is included for illustration, even though it shows poor convergence with respect to the Reciprocity parameter (t-ratio magnitude is greater than 0.3)
Fig. 7(a) The bi-fan and bi-parallel four-node motifs. Goodness-of-fit plots for these motifs for (b) the Alon E. coli regulatory network Model 1 (Table 6), and (c) the Alon yeast regulatory network Model 1 (Table 7). The observed network statistics are plotted as a red diamond, with the statistics of 100 simulated networks plotted as black boxplots