| Literature DB >> 32083072 |
Mikaela Koutrouli1, Evangelos Karatzas1,2, David Paez-Espino3, Georgios A Pavlopoulos1.
Abstract
Networks are one of the most common ways to represent biological systems as complex sets of binary interactions or relations between different bioentities. In this article, we discuss the basic graph theory concepts and the various graph types, as well as the available data structures for storing and reading graphs. In addition, we describe several network properties and we highlight some of the widely used network topological features. We briefly mention the network patterns, motifs and models, and we further comment on the types of biological and biomedical networks along with their corresponding computer- and human-readable file formats. Finally, we discuss a variety of algorithms and metrics for network analyses regarding graph drawing, clustering, visualization, link prediction, perturbation, and network alignment as well as the current state-of-the-art tools. We expect this review to reach a very broad spectrum of readers varying from experts to beginners while encouraging them to enhance the field further.Entities:
Keywords: biological networks; clustering; graph theory; topology; visualization
Year: 2020 PMID: 32083072 PMCID: PMC7004966 DOI: 10.3389/fbioe.2020.00034
Source DB: PubMed Journal: Front Bioeng Biotechnol ISSN: 2296-4185
Figure 1Network representations and types. (A) Two graphical representations of a graph G = (V, E) with vertex set V = {1, 2, 3, 4, 5} and edge set E = {{1, 3}, {2, 3}, {3, 4}, {3, 5}, {4, 5}}. (B) Representation of subgraph G′ = (V′, E′) with vertex set V = {3, 4, 5} and edge set E = {{3, 4}, {3, 5}, {4, 5}}. (C) Graph G″ = (V″, E″) is isomorphic to graph G = (V, E) with vertex set V = {a, b, c, d, e} and edge set E = {{a, c}, {b, c}, {c, d}, {c, e}, {d, e}}. (D) Undirected graph G = (V, E) with vertex set V = {1, 2, 3, 4, 5} and edge set E = {{1, 3}, {2, 3}, {3, 4}, {3, 5}, {4, 5}}. (E) Directed graph G = (V, E) with vertex set V = {1, 2, 3, 4, 5} and edge set E = {{3, 1}, {3, 2}, {3, 4}, {4, 5}, {3, 5}, {5, 3}}. (F) Semantic graph. (G) Weighted graph G = (V, E) with vertex set V = {1, 2, 3, 4, 5} and edge set E = {{3, 1, 0.4}, {3, 2, 0.1}, {3, 4, 1.0}, {4, 5, 0.1}, {3, 5, 0.4}}. (H) Mixed graph G = (V, E) with vertex set V = {1, 2, 3, 4, 5} and edge set E = {{1, 3}, {3, 2}, {3, 4}, {5, 3}, {4, 5}}. (I) Bipartite graph with vertex set V′ = {1, 2, 3}, V″ = {4, 5, 6, 7} and edge set E = {{1, 4}, {1, 7}, {2, 4}, {2, 5}, {3, 6}, {3, 7}}. (J) Multi-edge graph G = (V, E) with vertex set V = {1, 2, 3} and three different types of edge sets E′ = {{1, 2}, {2, 3}, {3, 1}}, E″ = {{1, 2}, {1, 3}}, E‴ = {{1, 3}}. (K) Hypergraph G = (V, E) with vertex set V = {1, 2, 3, 4, 5} and an edge connecting multiple nodes E = {{1, 2, 3, 4, 5}}. (L) A tree graph G = (V, E) with vertex set V = {1, 2, 3, 4, 5, 6, 7} and edge set E = {{1, 2}, {2, 4}, {2, 5}, {1, 3}, {3, 6}, {3, 7}}. (M) A graph G = (V, E) with vertex set V = {1, 2, 3, 4, 5, 6, 7, 8, 9} and edge set E = {{1, 2}, {1, 3}, {1, 5}, {2, 3}, {2, 4}, {2, 5}, {2, 6}, {2, 7}, {3, 4}, {3, 5}, {3, 9}, {4, 5}, {4, 8}}. A cluster consisting of nodes V = {1, 2, 3, 4, 5} and edges E = {{1, 2}, {1, 3}, {1, 5}, {2, 3}, {2, 4}, {2, 5}, {3, 4}, {3, 5}, {4, 5}}. (N) A five-node clique on the right. Any node is connected with any other node.
Figure 2Adjacency matrices and alternative data structures. (A) Simple undirected graph consisting of five nodes (N = V = 5) and four edges (E = 4). (B) A directed graph represented by a non-symmetric adjacency matrix. (C) A simple weighted graph. (D) The bipartite graph and its adjacency matrix. (E) The graph's projections. In the projected network colored as green, node V1 for example is connected to node V2 through node node V4. (F) The upper triangular part of the adjacency matrix. (G) The upper triangular part of the adjacency matrix in a linear form. Element A[2,3] = 0.9 in the adjacency matrix is element B[10] = 0.9 in the linear form. (H) The graph presented as an adjacency list. Each vertex is accompanied by a list containing all other vertices adjacent to it. (I) A data structure for efficiently storing sparse matrices with many zeros. The first two rows indicate the coordinates in an adjacency matrix, whereas the third column contains the connection weight.
Figure 3Network properties and topological features. (A) A network G = (V, E) consisting of V = 18 nodes and E = 21 edges. Each node's size has been adjusted according to its degree. Vertex V1 for example has 10 neighbors, thus degree d(V1) = 10. The average degree for the whole network is . Network has been visualized with Cytoscape. (B) A scatterplot histogram showing the degree distribution. The Y axis holds the values about how many nodes have certain degree (values in X axis). (C) Clustering coefficient. Node V has 6 neighbors {V1 V2, V3, V4, V5, V6}. The maximum number of edges between these neighbors are 15 but only two neighbors (V1 and V2) are connected to each other thus making the clustering coefficient for node V equal to . (D) Similarly, the neighbors of node V are connected with 11 edges between each other (E = {{V1,V2}, {V1,V5}, {V1,V4}, {V2,V3}, {V2,V6}, {V2,V4}, {V3,V6}, {V3,V5}, {V4,V6}, {V4,V5}, {V5,V6}}), the clustering coefficient for this node will be 0.733. Notably dotted lines represent the direct connections of node V, whereas the solid lines represent the connections between the first neighbors of node V. (E) The closeness centrality in blue, the betweenness centrality in red and the eccentricity centrality in orange. The graph consists of 6 nodes and 5 edges. Closeness centrality calculation example: Node V1accesses nodes V2, V4, V5, V6 with step 1 and node V3 with step 2. Therefore, its closeness centrality is calculated as . Betweenness centrality calculation example: Since all nodes are accessible through any other node, there are N(N − 1) = 6 × 5 = 30 shortest paths but only 12 of them pass through node V2. These are {V3, V2}, {V3, V2, V1}, {V3, V2, V1, V4}, {V3, V2, V1, V5}, {V3, V2, V1, V6}, {V2, V1}, {V2, V1, V4}, {V2, V1, V6}, {V2, V1, V5}, {, V2, V3}, {, V2, V3} and {, V2, V3}. Therefore the . Eccentricity calculation example: Node V1 accesses nodes V2, V4, V5, V6 with one step and node V3 with two steps. Therefore, its eccentricity will be max (2, 1) = 2.
Figure 4Motifs. (A) Motif examples of three and four nodes. (B) The 13 possible directed motifs using three nodes.
Figure 5Network models. (A) An Erdos–Rényi random network. (B) A Watts-Strogatz network. (C) Barabási–Albert (BA) scale-free network. Graphs were visualized using R. Code example: g1 = sample_smallworld (1, size = 500, nei = 4, p = 0.03). Plot (g1, layout = layout.fruchterman.reingold, vertex.label = NA, edge.arrow.size = 0.02, vertex.size = 0.5, xlab = “Random Network: G(N,p) model”).
Figure 6Examples of biological networks. (A) A protein-protein interaction (PPI) network shown in Cytoscape. (B) A sequence similarity network visualized with Cytoscape. Each edge corresponds to an alignment score. (C) A KEGG metabolic pathway. (D) A Reactome signal transduction network. (E) The tree of life visualized by iTOL. (F) A gene expression network with up- (red) and down-regulated genes (green). (G) A Savanna food web (credit: Siyavula Education). (H) A tagged PubMed abstract showing abstract-based co-occurrences. (I) A STRING multi-edge PPI knowledge network.
Figure 7Examples of file formats. (A) Simple undirected graph consisting of seven nodes (V = 7) and six edges (E = 6). (B) Network in Tab-delimited file format. (C) Network in GraphML file format. Blue box highlights the interaction between nodes V1 and V7. (D) A cytoscape.js graph encoded in JSON. (E) Network in PSI-MI file format.
Figure 8Network layouts. (A) Grid layout. (B) Circular layout. (C) Hierarchical layout. (D) Force-directed layout (E) Edge-bundling. All views have been generated with Cytoscape.
Figure 9Network representations. (A) A network visualized by Cytoscape with the use of a force-directed layout algorithm. (B) A multi-layered graph visualized by Arena3D. (C) A hive-plot view. (D) A network in 3D visualized by Graphia application (E) A multi-edge network visualized by STRING. (F) A network visualized with the use of arcs. (G) A network visualized as a colored adjacency matrix. (H) A circular Circos view. (I) Visualization of a bipartite graph.
Figure 10Network clustering. (A) A Yeast PPI network. (B) The PPI network clustered with MCL. (C) The PPI network clustered with MCL with the initial connections restored. (D) The initial network structure with some MCL clusters highlighted. (E) A cluster in high resolution. (F) Gene Ontology enrichment related for the zoomed cluster. Visualization is offered through Cytoscape whereas clustering has been performed with the use of ClusterMaker2 plugin.
Figure 11Example of hierarchical clustering. (A) The expression values of five genes in three conditions. (B) The chart showing the genes' expression values as patterns. (C) The Pearson correlation coefficient (PCC) matrix showing all pairwise PCC values. (D) The Pearson correlation matrix in the form of a fully connected graph. (E) The distance matrix as a product of the PCC matrix (D = 1 − PCC). (F) A 2D average linkage hierarchical clustering. Genes G1, G2 as well as genes G3, G4, G5 are clustered together.
Figure 12Clustering comparisons. (A) Rand Index between C1 and C2. C11 and C12 are clusters 1 and 2 of the C1 clustering, respectively. One pair [1, 2] is clustered together in both clusterings, three pairs [1, 5], [2, 5], and [3, 4] are clustered differently in both clusterings and the rest six pairs [1, 3], [1, 4], [2, 3], [2, 4], [3, 4], and [4, 5] have been placed together in only one of the two clusterings. The Rand Index between the two clusterings is calculated as . (B) Maximum-Match-Measure between C1 and C2. C1 has four clusters while C2 has three. At the first iteration the cluster-intersections' confusion matrix element is chosen and column 1 and row 1 are crossed out. At the second iteration the maximum element of the remaining confusion matrix is and column 2 and row 2 are crossed out. At the third and final iteration is chosen. The metric is calculated as: . On the same schema if C1 is chosen as the optimal clustering the F-measure for C2 can be calculated. First, the precision and recall measures are calculated for clusters C11 and C21 as and . Then, the F-Measure can be calculated between these two clusters as F(C11, C21) = = = . By calculating the respective values for the rest of the cluster pairs, the matrix (C) is created. The overall F-Measure of C2 against C1 is . (D) Variation of Information matrix of the P(i, j) probabilities of an element being in the intersection of clusters. Based on the two clustering schemas of (A) the entropy of C1 is H(C1) = = and following the same procedure H(C2) ≃ 0.97. The mutual information between the two clusterings is calculated as I(C1, C2) = The final value of the Variation of Information metric becomes VI(C1, C2) = H(C1) + H(C2) − 2I(C1, C2) = 0.97 + 0.97 − 2 * 0.02 = 1.9.
Figure 13A topology-based network alignment example. (A) Possible graphlet compositions for 2 and 3 nodes. Orbits 0–3, which are annotated, represent the possible position for a node in the various graphlets. (B) G1 and G2 graph representations. (C) Graphlet degree signatures. The row names represent the nodes, while the column names the different orbits. (D) The final network alignment based on a simplified version of the GRAAL algorithm. Node V1 (in red color) remains unaligned.
Figure 14A triangle closing link prediction example. (A) The adjacency matrix of the undirected, unweighted example network. (B) The algebraic representation of the A2 matrix. Each (i,j) value represents the number of common neighbors of the nodes i and j. (C) The example network plot. The red edge represents the new predicted link in time point t + 1.