| Literature DB >> 26379697 |
Paola Lecca1, Angela Re2.
Abstract
Detection of the modular structure of biological networks is of interest to researchers adopting a systems perspective for the analysis of omics data. Computational systems biology has provided a rich array of methods for network clustering. To date, the majority of approaches address this task through a network node classification based on topological or external quantifiable properties of network nodes. Conversely, numerical properties of network edges are underused, even though the information content which can be associated with network edges has augmented due to steady advances in molecular biology technology over the last decade. Properly accounting for network edges in the development of clustering approaches can become crucial to improve quantitative interpretation of omics data, finally resulting in more biologically plausible models. In this study, we present a novel technique for network module detection, named WG-Cluster (Weighted Graph CLUSTERing). WG-Cluster's notable features, compared to current approaches, lie in: (1) the simultaneous exploitation of network node and edge weights to improve the biological interpretability of the connected components detected, (2) the assessment of their statistical significance, and (3) the identification of emerging topological properties in the detected connected components. WG-Cluster utilizes three major steps: (i) an unsupervised version of k-means edge-based algorithm detects sub-graphs with similar edge weights, (ii) a fast-greedy algorithm detects connected components which are then scored and selected according to the statistical significance of their scores, and (iii) an analysis of the convolution between sub-graph mean edge weight and connected component score provides a summarizing view of the connected components. WG-Cluster can be applied to directed and undirected networks of different types of interacting entities and scales up to large omics data sets. Here, we show that WG-Cluster can be successfully used in the differential analysis of physical protein-protein interaction (PPI) networks. Specifically, applying WG-Cluster to a PPI network weighted by measurements of differential gene expression permits to explore the changes in network topology under two distinct (normal vs. tumor) conditions. WG-Cluster code is available at https://sites.google.com/site/paolaleccapersonalpage/.Entities:
Keywords: clustering; connected component; edge weight; entropy; node weight; protein–protein network; weighted network
Year: 2015 PMID: 26379697 PMCID: PMC4551098 DOI: 10.3389/fgene.2015.00265
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1Algorithmic modules of WG-Cluster. WG-Cluster takes as input the SIF file of the network edges and a text file reporting node labels in the first column and node weights in the second one. If the second file is not available, WG-Cluster by default assigns an equal weight to all nodes. WG-Cluster implements three computational modules: (i) an unsupervised version of the K-means algorithm identifies sub-graphs with similar edge weights, (ii) a fast-greedy algorithm detects the connected components of each sub-graph utilizing similarity in node topological properties, (iii) the estimates of the convolution of connected component entropy and sub-graph mean edge weight guide the selection of significant connected components representative of global trends in the network. Complexity of modules for estimating the optimal number of sub-graphs and for running the Lloyd's K-means is linear in the number of edges NE and number of iterations; the complexity of the module for detecting connected components is (V(logV)2), where V is the number of vertices.
Compute the optimal number of sub-graphs K
| 1: |
| 2: |
| 3: |
| 4: |
| 5: |
| 6: |
| 7: wcss[1] ←(NE - 1) × Variance(edge.weights) |
| 8: |
| 9: set.seed(seed) |
| 10: |
| 11: wcss[i] ← |
| 12: |
| 13: |
| 14: |
| 15: |
| 16: |
| 17: n.sub.graphs ← 1:max.n.sub.graphs |
| 18: wcss.derivative ← Stineman.derivative(n.sub.graphs, wcss) |
| 19: |
| 20: |
| 21: |
| 22: tolerance ← ϵ |
| 23: |
| 24: |
| 25: |
| 26: wcss.derivative.null ← {−ϵ ≤ wcss.derivative ≤ ϵ } |
| 27: K ← wcss.derivative.null[1] |
| 28: |
| 29: |
| 30: |
| 31: |
Lloyd's K-means algorithm
| 1: |
| 2: |
| 3: |
| 4: |
| 5: |
| 6: |
| 7: a. Assign edge weights to the centroids |
| 8: |
| 9: Assign edge.weights[i] to closest sub-graph according to the distance measure. |
| 10: |
| 11: b. Recalculate centroids. |
| 12: |
| 13: |
Figure 2Sub-graph decomposition into connected components. The algorithm first clusters the input graph into sub-graphs consisting of similar edge weights and next detects the connected components present within each sub-graph.
Detection and selection of connected components
| 1: |
| 2: |
| 3: |
| 4: |
| 5: |
| 6: |
| 7: |
| 8: connected.components[[i]] ← fast.greedy.decomposition(sub-graph[i]) |
| 9: |
| 10: |
| 11: |
| 12: |
| 13: ( |
| 14: |
| 15: connected.components.entropy[l] |
| 16: ← entropy(connected.components[[i]][l], node.weights) |
| 17: |
| 18: |
| 19: |
| 20: |
| 21: random.cc.component |
| 22: ← erdos.renyi.graph(nr.of.nodes = |
| 23: edge.weights.random.cc.component ← Unif(0, 1) |
| 24: node.weights.random.cc.component ← Unif(0, 1) |
| 25: random.cc.entropies[v] |
| 26: ← calculate.entropy(random.cc.component,node.weights.random.cc.component,edge. weights.random.cc.component) |
| 27: |
| 28: |
| 29: |
| 30: random.cc.entropy[l] |
| 31: ← calculate.mean.entropy(random.cc.entropies) |
| 32: |
| 33: |
| 34: selected.connected.components ← |
| 35: |
| 36: discard |
| 37: |
| 38: |
| 39: |
| 40: |
| 41: |
| 42: |
| 43: |
| 44: density.of.convolution ← density(convolve (E, MW)) |
| 45: |
| 46: |
| 47: |
Summary of widely used hierarchical methods for module detection.
| Edge-Betweenness (Girvan and Newman, | Directed and undirected | True | False |
| Fast-greedy (Clauset et al., | Directed and undirected | True | False |
| InfoMap (Rosvall and Bergstrom, | Directed and undirected | True | True |
“True” and “False” in the two last columns stand for “the method can process also” and “the method does not process,” respectively. For instance, edge-betweenness clustering method can process and take into account edge weights, but it does not handle information about node weights.
Figure 3Running times to cluster random weighted graphs with increasing number of edges. WG-Cluster running time on a random weighted graph of 500 nodes and an increasing number of edges is compared with that achieved by the edge betweenness graph clustering algorithm (Girvan and Newman, 2001) and that of InfoMap (Rosvall and Bergstrom, 2008). Each algorithm was utilized in its R implementation on a desktop Windows 8.1 PC with a 3.1 GHz CPU. WG-Cluster ensured faster running time and a RAM usage inferior to 3Gb.
Figure 4Network properties of WG-Cluster reconstructed modules. (A) Bar plot displaying the fraction of connected components which are discarded / retained according to the number of standard deviations of the entropy from the mean value of the distribution of entropy derived from randomized connected components. (B) Density plot of the convolution between the connected component entropy and mean edge weight of the respective sub-graph. Maximum points in the density plot are highlighted by arrows. The number at each arrow denotes the number of selected connected components, i.e., connected components whose entropy and mean edge weight correspond to convolution intervals at the maxima of the density plot. (C) Dot plot displaying the percentage of interactions yielding differential co-expression scores higher than IntAct scores as a function of subgraph mean edge weight. Scores are taken in absolute value. (D) Bar plot showing the fraction of connected components retained in each sub-graph. (E) Bar plot showing the mean entropy of connected components selected solely on the basis of entropy significance or on the basis of convolution analysis in each sub-graph.
Figure 5Module enrichment in Gene Ontology categories. Heat map showing in each sub-graph the number of connected components which resulted statistically significant enriched in GO Biological Process categories (adjusted P-value < 0.05). Vertical bar colors denote the sign of sub-graph mean edge weights.