Literature DB >> 35677641

Quantification of network structural dissimilarities based on network embedding.

Zhipeng Wang¹, Xiu-Xiu Zhan¹, Chuang Liu¹, Zi-Ke Zhang².

Abstract

Quantifying structural dissimilarities between networks is a fundamental and challenging problem in network science. Previous network comparison methods are based on the structural features, such as the length of shortest path and degree, which only contain part of the topological information. Therefore, we propose an efficient network comparison method based on network embedding, which considers the global structural information. In detail, we first construct a distance matrix for each network based on the distances between node embedding vectors derived from DeepWalk. Then, we define the dissimilarity between two networks based on Jensen-Shannon divergence of the distance distributions. Experiments on both synthetic and empirical networks show that our method outperforms the baseline methods and can distinguish networks well. In addition, we show that our method can capture network properties, e.g., average shortest path length and link density. Moreover, the experiment of modularity further implies the functionality of our method.

Entities: Chemical

Keywords: Computer science; Network; Network topology

Year: 2022 PMID： 35677641 PMCID： PMC9168171 DOI： 10.1016/j.isci.2022.104446

Source DB: PubMed Journal: iScience ISSN： 2589-0042

Introduction

Network is a natural representation of complex data associations and it has been used in many domains ranging from biology (Liu et al., 2020) and physics (Boccaletti et al., 2006) up to social sciences (Strogatz, 2001). Because of the specific characteristics of the complex system it represents, network emerges complex non-trivial topological features, such as scale-free (Barabási and Albert, 1999) and small-world properties (Watts and Strogatz, 1998). The flexibility of network modeling and the rapid growth of network data in recent years make it urgent to design effective network comparison methods. Because comparing structural similarities between networks is an important task, which has various scientific applications, e.g., the comparison of brain networks for different subjects (Bullmore and Sporns, 2009) and diffusion cascade of news (Zhan et al., 2018), the classification of proteins (Liu et al., 2020), the identification of changing points of temporal networks (Holme and Saramäki, 2012), and the evaluation of generative network models (Hartle et al., 2020; Ali et al., 2014; De Domenico et al., 2015). Researchers have proposed methods based on graph isomorphism to compare networks (Zemlyachenko et al., 1985; Babai, 2016; Grohe and Schweitzer, 2020). The main limitations of isomorphism-based methods are as follows: first of all, isomorphism-based methods can only compare networks with the same size and are not scalable to large networks with millions of nodes. Secondly, this kind of methods can only tell whether two networks are isomorphic or not but to what extent two networks are different is hardly measured. Thanks to the mature research of network topology mining (Costa et al., 2007; Martínez and Chavez, 2019; Tsitsulin et al., 2018; Gärtner et al., 2003), a number of researchers have studied how to use network characteristics, e.g., adjacency matrix, node degree, and shortest path distance, to compare networks with huge and different sizes. For instance, Saxena et al. (2019) introduced a network similarity method based on hierarchical diagram decomposition via using Canberra distance, which considers both local and global network topology. Lu et al. (2014) proposed a manifold diffusion method based on random walk, which can not only distinguish networks with different degree distributions but also can discriminate networks with the same degree distribution. Beyond the direct comparison of network topology, we have witnessed the effectiveness of using quantum information science, i.e., information entropy, in network comparison. For example, De Domenico and Biamonte (2016) proposed a set of information theory tools for network comparison based on spectral entropy. Schieber et al. (2017) quantified the dissimilarities between networks by considering the probability distribution of the shortest path distance between nodes. Chen et al. (2018) proposed a comparison method based on the node communicability sequence entropy. Bagrow and Bollt (2019) proposed a method based on portrait divergence to compare networks. The portrait divergence-based method incorporates the topological characteristics of networks at all scales and is applicable to all types of networks. The basic idea behind this kind of methods is that one specific network property, such as the shortest path distance (Schieber et al., 2017) and node communicability matrix (Chen et al., 2018), is chosen to measure the information content of a network via a proper entropy. Therefore, the dissimilarity between two networks is given by the difference between the information content of them. However, we claim that the selection of one specific property as a representative of network information content may not be able to capture the information of a whole network. For example, we can quantify the network dissimilarities through comparing the distance distribution based on the information entropy. However, the shortest path-based distance between nodes is only one kind of properties in a network; it cannot represent the complete structure of a network. Therefore, how to extract network features sufficiently to quantify network differences is an urgent problem to be solved. Network embedding, which aims to embed each node into a low-dimensional vector by preserving the network structure as much as possible, has been widely used to solve many problems in network science, e.g., link prediction (Bu et al., 2019; Grover and Leskovec, 2016), community detection (Jin et al., 2019; Li et al., 2016; Fortunato, 2010), and network reconstruction (Pio et al., 2020; Xu et al., 2020; Goyal and Ferrara, 2018). In this paper, we further widen the application of network embedding, i.e., we explore how to use network embedding to characterize the dissimilarity of two networks in a state-of-the-art way. We start from using a simple and fast network embedding algorithm, i.e., DeepWalk, which can capture the global information of a network, to measure the distance between two nodes for a given network. Then, the information content of a network, i.e., network similarity heterogeneity, is defined based on the node distance distribution and Jensen-Shannon divergence. Accordingly, the dissimilarity between two networks is further defined upon network similarity heterogeneity between a pair of networks. We validate the effectiveness of the network embedding-based comparison method on both synthetic and empirical networks. Compared to the baseline methods, network embedding-based comparison shows high distinguishability.

Results

Embedding-based network dissimilarity

Given a network , in which V represents the node set, and E = is the edge set, the number of nodes is given by N = , where indicates the cardinal number of a set. The adjacency matrix of G is given by , in which if there is a link between node and , otherwise . We use DeepWalk to learn the embedding vector of every node (Perozzi et al., 2014). Concretely speaking, DeepWalk conducts a uniform random walk to obtain node sequences as the input for a learning model, i.e., SkipGram. The embedding vectors of the nodes contain the structure information of the original network. For a node , we use to represent the embedding vector obtained from DeepWalk. Therefore, we can define the Euclidean distance between two arbitrary nodes and as = . Smaller indicates that and are more similar. The Euclidean distance matrix is denoted as , in which is the Euclidean distance between node and all the N nodes. Hence, we have . We define and . We use to represent the Euclidean distance distribution of node , in which is the probability that the Euclidean distance between a node and node follows in the bin . L is a tunable parameter. It is worth noting that the distance used here is not limited to Euclidean distance. We test the robustness of our embedding method by using distance matrix generated by Manhattan distance and inner product between node embedding vectors. The performance of using these two distances for network comparison is further given in Figures S3–S4, which shows that different distance measures will not change the similarity trend of our network embedding-based comparison method. We introduce Jensen-Shannon divergence to define the network dissimilarity based on the Euclidean distance distribution. The Euclidean distance distribution heterogeneity of a network, i.e., , measures the heterogeneity of a network G in terms of the connectivity distances, and a network that possesses a high diversity of node distance patterns corresponds to a large value, which is defined as:where represents the Jensen-Shannon divergence of the node Euclidean distance distribution. The average Euclidean distance distribution for a network G is given by , in which (j = 1 , L), i.e., is the average value of the dimension of H and represents the average probability of nodes that have Euclidean distance falls in the bin . Given two networks and , we denote and as the average Euclidean distance distributions of and , respectively. The dissimilarity between and () is given bywhere ω is a tunable parameter that controls the extent of global and local differences while comparing two networks, and thus we have . The first term in Equation (2) compares the global dissimilarities between networks through calculating the average Euclidean distance distributions. The second term compares the local differences through evaluating the dissimilarity of Euclidean distance heterogeneity between two networks. Smaller value of indicates that and are more similar. To obtain node embedding vector from DeepWalk, we set the parameters such as embedding dimension , number of walks per node , the length of each walk , and the context window size . In addition, we set in the Euclidean distance distribution . The influence of distribution length L and the parameters d, s, l, and w of DeepWalk on the performance of network comparison is further given in Figures S5–S16, which shows different settings of these parameters will not change the similarity trend between networks. In Figure 1, we show the network dissimilarity comparison process of our method . In Figure 1A, we show two networks, i.e., and , in which is a fully connected network and has one isolated node. The detailed calculation process of the method is shown in Figures 1B and 1C, which include the calculation of node embedding vector, node Euclidean distance matrix, node distance distribution, and the dissimilarity between the two networks via Equations (1) and (2). The dissimilarity between and via is as high as 0.32.

Figure 1

Illustration of the network embedding-based comparison method

(A) Visualization of two networks and , each with 12 nodes and 12 edges. It should be noted that our method is applicable to compare networks with different node and edge sizes.

(B–C) Example of how to compute network embedding-based dissimilarity, including the characterization of the node embedding, calculation of the node Euclidean distance, node distance distribution, average Euclidean distance distribution, and the dissimilarity between network and , where we use .

Illustration of the network embedding-based comparison method (A) Visualization of two networks and , each with 12 nodes and 12 edges. It should be noted that our method is applicable to compare networks with different node and edge sizes. (B–C) Example of how to compute network embedding-based dissimilarity, including the characterization of the node embedding, calculation of the node Euclidean distance, node distance distribution, average Euclidean distance distribution, and the dissimilarity between network and , where we use .

Synthetic network comparison

To verify the ability of our method in quantifying the network dissimilarity, we perform the comparison on synthetic networks including networks generated by WS and BA models. In all the network models, we use the network size . In WS model, we compare networks generated by different rewiring probability p, where the network average degree is 10. Figures 2A–2C show the dissimilarity values obtained by , , and between networks generated by WS model with different p. Generally, we find that all the three kinds of dissimilarity values of the networks generated with similar p values are much smaller than those of the networks generated with dramatically different values of p. The proposed method can detect the network dissimilarity for all the p values (Figure 2A), while and cannot identify the difference between networks for large values of p (Figures 2B and 2C). The definition of , , and is based on the embedding-based distance distribution, the shortest path distance distribution, and the node communicability distribution, respectively. The embedding-based distance distributions are distinguishable across different p (Figure 2D). However, the distributions of shortest path distance and node communicability are so narrow for large p values (Figures 2E and 2F), leading to no difference for the corresponding network comparison methods. Besides, the comparison of networks generated by WS model in the log-spaced of p is further given in Figures S1A–S1C, which reveals the same results as that of Figures 2A–2C. In BA model, we generate networks by changing the value of m, which is the number of edges per node added at each time step. Figures 2G–2I show the comparison of networks generated by BA model with via the three methods. Similarly as the WS model, shows the best performance. The reason that and perform worse is given by the average shortest path distance distributions and the node communicability distributions when changing m in Figures 2K and 2L, respectively. In addition, we also compare the dissimilarities between the preferential attachment networks generated by different values of nonlinear preferential attachment kernel α in Figures S1D–S11F, which again shows our method outperforms the baselines.

Figure 2

Performance of three comparison methods on synthetic networks

(A–C) Dissimilarity values , , and of networks generated by WS model, respectively.

(D) The embedding-based average distance distributions of networks generated by WS model with different p based on .

(E) The average distance distributions of networks generated by WS model with different p based on .

(F) The node communicability distributions of networks generated by WS model with different p based on .

(G–I) Dissimilarity values , , and of networks generated by BA model under different m, respectively, in which .

(J) The embedding-based average distance distributions of networks generated by BA model with different m based on method .

(K) The average distance distribution of networks generated by BA model with different m based on method .

(L) The node communicability distribution of networks generated by BA model with different m based on method . All the results are averaged over 100 realizations, where we use ω = 0.5.

Performance of three comparison methods on synthetic networks (A–C) Dissimilarity values , , and of networks generated by WS model, respectively. (D) The embedding-based average distance distributions of networks generated by WS model with different p based on . (E) The average distance distributions of networks generated by WS model with different p based on . (F) The node communicability distributions of networks generated by WS model with different p based on . (G–I) Dissimilarity values , , and of networks generated by BA model under different m, respectively, in which . (J) The embedding-based average distance distributions of networks generated by BA model with different m based on method . (K) The average distance distribution of networks generated by BA model with different m based on method . (L) The node communicability distribution of networks generated by BA model with different m based on method . All the results are averaged over 100 realizations, where we use ω = 0.5. We show how the dissimilarity between networks changes with the parameter ω in Figures S2A and S2B. In all the networks, we keep the average degree as 10. Each point in Figure S2A shows the dissimilarity between a network generated by WS model with size and , respectively. We set rewiring probability as . Different curves show the dissimilarity when we use different ω. We find that a network generated by WS model with size is more similar to networks generated with close size, and different ω does not affect the similarity trend. However, large value of ω results in larger dissimilarity values between networks. In Figure S2B, we give the same analysis for BA model, which shows the similar results as those of WS model. We also compare the differences between the following networks: BA, WSL (it is obtained by rewiring 1% of edges in K-regular network), and WSH (it is obtained by rewiring 10% of edges in K-regular network) in Figure S2C. Figure S2C shows the change of the dissimilarity values with the increase of ω, and the results show that large value of ω gives large dissimilarity value. Furthermore, when , indicating that only local structural information is used (Equation (2)), the differences between the three pairs of synthetic networks are not effectively distinguished. On the contrary, when , the global information of the network can better distinguish the network differences. Therefore, we set in the following analysis. To compare with different dissimilarity methods, we also show the dissimilarity between four synthetic networks with the same node size , edge size and average node degree 10. The four networks include K-regular, WSL, WSH, and BA model. From the generation model, we know that the descending order of similarity value between K-regular and the other three networks is as follows: WSL, WSH, and BA. Figure 3 gives the dissimilarity between the four networks with three methods, i.e., , , and . Figure 3 implies that dissimilarities between the four networks obtained by the three network comparison methods are consistent with the rules of network generation models. However, the dissimilarity values () between K-regular and WSL, K-regular and WSH are almost the same and the dissimilarity values () between the four synthetic networks are very close, indicating that the method cannot effectively discriminate the differences between these synthetic networks.

Figure 3

Comparison of four synthetic networks (K-regular, WSL, WSH, and BA)

We use three different methods, i.e., , , and , respectively, in which is the method of network embedding, calculates the dissimilarity value based on the method using distance distribution, and calculates the dissimilarity value based on the communication sequence entropy. We consider network size , average node degree 10. All the results are averaged over 100 realizations with .

Comparison of four synthetic networks (K-regular, WSL, WSH, and BA) We use three different methods, i.e., , , and , respectively, in which is the method of network embedding, calculates the dissimilarity value based on the method using distance distribution, and calculates the dissimilarity value based on the communication sequence entropy. We consider network size , average node degree 10. All the results are averaged over 100 realizations with .

Real networks comparison

We validate the effectiveness of our network embedding-based comparison method upon real networks from different domains. Table 1 gives the basic properties of the real networks, including the number of nodes (N), the number of edges (), average degree (), average path length (), link density (), clustering coefficient (C), and diameter (). The 12 real networks range from the protein-protein interaction (Yeast) and metabolic network (Metabolic) in biology, to the human contact (Infectious, Windsurfers), and to the social communication networks (Pgp, Rovira, Petster, Petsterc, and Irvine).

Table 1

Basic properties of real networks

Networks	N	\|E\|	Ad	Avl	Ld	C	dia
Pgp	10,680	24,316	4.55	7.49	0.0004	0.266	24
Yeast	1,870	2,203	2.44	6.81	0.0013	0.067	19
Contiguous	49	107	4.37	4.16	0.0910	0.497	11
Infectious	410	2,765	13.49	3.63	0.0330	0.456	9
Rovira	1,133	5,451	9.62	3.61	0.0085	0.220	8
Petsterc	2,426	16,631	13.71	3.59	0.0057	0.538	10
Petster	1,858	12,534	13.49	3.45	0.0073	0.141	14
Irvine	1,899	59,835	14.57	3.06	0.0079	0.109	8
Metabolic	453	2,025	8.94	2.68	0.0198	0.646	7
Jazz	198	2742	27.69	2.24	0.1406	0.617	6
Chesapeake	39	170	8.72	1.83	0.2294	0.450	3
Windsurfers	43	336	15.63	1.69	0.3721	0.653	3

N, , , , , C, and represent the number of nodes, the number of edges, average degree, average shortest path length, link density, average clustering coefficient, and network diameter, respectively.

Basic properties of real networks N, , , , , C, and represent the number of nodes, the number of edges, average degree, average shortest path length, link density, average clustering coefficient, and network diameter, respectively. Firstly, we show the difference between a real network and its corresponding null models in Figure 4A. For a network G, we consider three kinds of null models (k-order null models, including k = 1.0, 2.0, and 2.5) (Orsini et al., 2015), which is defined as , , and , respectively. Specifically, different values of k indicate the preservation of network topology to different degrees. indicates that the generated network retains the degree sequence. When , the degree sequence and degree correlation properties are invariant during the rewiring process. preserves the clustering spectrum property of the original network. The dissimilarity values are averaged over 100 repeated independent runs. With the increase of k, the dissimilarity between a real network and its randomized networks tends to be smaller across different networks (each row in Figure 4A). The pattern of the network dissimilarity values is consistent with the randomization process, where larger k indicates that the randomized networks share more properties with the original network, leading to the more similarity to the original network.

Figure 4

Dissimilarity between real networks

(A) Comparison between real networks and their null models. We consider the models with different k-values (1.0, 2.0, and 2.5).

(B) Dissimilarity between Petsterc network and the networks generated after certain perturbations, where negative value of f corresponds to the random deletion of edges with the given ratios, and vice versa. Each point in the figure is averaged over 100 times. The shaded error area shows the standard deviation of 100 times.

Dissimilarity between real networks (A) Comparison between real networks and their null models. We consider the models with different k-values (1.0, 2.0, and 2.5). (B) Dissimilarity between Petsterc network and the networks generated after certain perturbations, where negative value of f corresponds to the random deletion of edges with the given ratios, and vice versa. Each point in the figure is averaged over 100 times. The shaded error area shows the standard deviation of 100 times. We also compare the real networks with the networks after certain perturbation. The perturbation is performed as follows: for a given network, we randomly add (or delete) a certain fraction of edges, and then compare the dissimilarity between the original network and the perturbed network. Positive f represents addition process, and negative f represents deletion process. Figure 4B shows the dissimilarity between Petsterc network and the perturbed networks after random addition or deletion of edges. It implies that the more we perturb the network, the more dissimilar it is to the original network. We show the similar trend of the other networks in Figure S6. The results indicate that our comparison method can distinguish the differences between a real network and the networks generated after certain perturbation.

Analysis on the hybrid method

Figure 2 shows is an effective way to distinguish networks and shortest path distance-based method () can partly tell the difference between different synthetic networks. We further hybridize these two distance distributions to explore the performance of the hybrid method on network comparison. To recap, we use and to represent the shortest path distance distribution and the distance distribution based on network embedding, respectively. As the dimension of p and H is different, we expand short dimension matrix with zero values. That is to say, if , we expand to dimensions, i.e., . And if , we expand to dimensions, i.e., . For each node , the hybrid distance distribution is defined as the normalization of . Thus, we define the hybrid distance distribution M aswhere λ is a tunable parameter. We use M to replace H in Equation (2), and obtain the hybrid network comparison method, which is denoted as . We test the performance of on the comparison of a network and its null models (i.e., , , and ) in Figure 5. We use , , and to represent the dissimilarity between the original network and its null models, respectively. The pattern of the dissimilarity between a real network and its null models (i.e., ) when is consistent with the order of the null models. However, cannot tell the difference between the network and its null models very well, i.e., and share the same value in Figures 5H, 5I, and 5K when . In fact, indicates that the hybrid distance distribution degrades into only considering the shortest path distance distribution (Equation (3)), leading to for . The network basic features show that the average shortest path length of Irvine (Figure 5H), Metabolic (Figure 5I), Chesapeake (Figure 5K), and Windsurfers (Figure 5L) are significantly smaller than the other networks, which cannot be well compared according to for . It indicates that cannot well tell the difference of the real network with small average shortest path length, which is consistent with the findings in the synthetic networks (Figures 2B and 2E). And for , which means the hybrid method degrades into , shows better discriminative performance on network comparison across networks with different average shortest path length, which implies the robustness of upon different network structure.

Figure 5

The dissimilarity between a network and its null models characterized by the hybrid method

When parameter λ = 0, the hybrid method degenerates to ; when λ = 1, the real shortest path distance distribution of the network is used to characterize the dissimilarity. The red line in each figure describes the dissimilarities between the real network and the DK1.0 changing with the parameter λ. The blue line in each figure describes the dissimilarities between the real network and the DK2.0 changing with the parameter λ. The green line in each figure describes the dissimilarities between the real network and the DK2.5 changing with the parameter λ.

The dissimilarity between a network and its null models characterized by the hybrid method When parameter λ = 0, the hybrid method degenerates to ; when λ = 1, the real shortest path distance distribution of the network is used to characterize the dissimilarity. The red line in each figure describes the dissimilarities between the real network and the DK1.0 changing with the parameter λ. The blue line in each figure describes the dissimilarities between the real network and the DK2.0 changing with the parameter λ. The green line in each figure describes the dissimilarities between the real network and the DK2.5 changing with the parameter λ.

Comparison between real networks

The dissimilarity between the 12 real networks is given in Figure 6. We show between network pairs in Figure 6A; we find that networks that have the similar value of average shortest path length tend to be similar. It implies that considers the path properties of a network when comparing networks. The implication is further amplified by the high Pearson correlation coefficient () between and given in Figure 6B, where the values of and are computed between the 12 real networks. Given two networks and , we define the average shortest path length difference and the link density difference between them as = and = , respectively. In Figure 6C, we show the Pearson correlation between and , which is as high as (). It further explains the results of Figure 6A, i.e., networks with similar average shortest path length tend to be similar. Meanwhile, the high Pearson correlation coefficient (Figure 6D, , ) is also found between and . In conclusion, the network embedding-based comparison method can capture network properties such as average shortest path length and link density.

Figure 6

Correlation analysis in real networks

(A) Comparison between real networks, in which networks are sorted in descending order based on average shortest path length.

(B) Correlation between network comparison methods and .

(D) Correlation between network dissimilarities and link density differences on 12 real networks. Where r value shows the Pearson correlation, p value shows the assumption probability and the shaded error area shows the confidence interval.

Correlation analysis in real networks (A) Comparison between real networks, in which networks are sorted in descending order based on average shortest path length. (B) Correlation between network comparison methods and . (C) Correlation between network dissimilarities and average shortest path length differences on 12 real networks. (D) Correlation between network dissimilarities and link density differences on 12 real networks. Where r value shows the Pearson correlation, p value shows the assumption probability and the shaded error area shows the confidence interval. Modularity reflects the strength of division of a network into communities (Newman, 2006), i.e., a network with a high modularity indicates that nodes are densely connected within the communities and sparsely connected across different communities. Thus, we explore the relationship between modularity and network structural difference. We define the community segmentation with the maximal network modularity as Q, which corresponds to the optimal division of a network (Newman and Girvan, 2004). Given two networks and , we define the modularity difference between them as = . Figure 7 shows the correlation between and dissimilarity value on 12 real networks. The result shows that the similar networks tend to have small value of and vice versa. It further emphasizes the good performance of our network embedding-based comparison method.

Figure 7

Correlation between network dissimilarities and modularity differences on 12 real networks

Discussion

In this paper, we propose a network embedding-based comparison method , which is based on node distance distribution and Jensen-Shannon divergence. Specifically, we firstly obtain the embedding vector for each node through DeepWalk and calculate the Euclidean distance between each of the node pairs. We measure the distance distribution heterogeneity of a network via defining the Jensen-Shannon divergence of the node distance distributions. The dissimilarity between two networks is further defined by the combination of the difference of the average distance distribution of the networks and the network Euclidean distance distribution heterogeneity. We compare the proposed method with two state-of-the-art methods, i.e., network dissimilarity based on shortest path distance distribution () and network dissimilarity based on communicability sequence (), on various synthetic and real networks. Furthermore, we find that shows better performance in quantifying network difference in almost all the networks. In addition, we find that is also linearly correlated with (Pearson correlation coefficient ), and thus can capture network properties such as average shortest path length and link density. Moreover, it shows that real networks that are similar to each other tend to have small difference in modularity. We confined ourselves to DeepWalk to embed networks, which is a simple and efficient network embedding method. According to previous work, more advanced embedding methods, such as Node2Vec (Grover and Leskovec, 2016) and graph neural network (Zhang and Chen, 2018), can better capture the topology of the network, generating better performance in tasks such as link prediction, clustering, and classification. Therefore, these embedding methods could be promising in quantifying network dissimilarity. We deem that our methods can also be generalized to other network types, such as multilayer networks (Kivelä et al., 2014), temporal networks (Holme, 2015), signed networks (Wang et al., 2017), and hypergraphs (Feng et al., 2019).

Limitations of the study

The distance distribution used in our comparison method is based on a random walk embedding algorithm, i.e., DeepWalk, which is a black box model. Therefore, it is hard to theoretically deduce the specific properties that can be captured by the comparison method.

STAR★Methods

Key resources table

Resource availability

Lead contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contacts, Xiu-Xiu Zhan (zhanxiuxiu@hznu.edu.cn), or Zi-Ke Zhang (zkz@zju.edu.cn).

Materials availability

This study did not generate new unique reagents.

Experimental model and subject details

This work does not use experimental models typical in the life sciences.

Method details

Datasets

We consider 12 kinds of real networks, the description of each network is as follows: Pgp is a interaction network of users of the Pretty Good Privacy (PGP) algorithm, which only contains the giant connected component. Yeast and Metabolic are the biological networks, in which Yeast is the protein interaction network and Metabolic is a metabolic network of Caenorhabditis elegans. Contiguous is a regional border network in the United States excepted isolated states Alaska and Hawaii. Rovira is an e-mail communication network at the University Rovira i Virgili in Tarragona in the south of Catalonia in Spain. Petsterc and Petster contain friendships and family links between users of the website, in which Petster is the giant connected component. Irvine is a messaging network between the users of an online community of students from the University of California, Irvine. Jazz is a collaboration network between Jazz musicians. Chesapeake is a mesohaline trophic network of Chesapeake Bay, an estuary in the United States of America. Windsurfers contains interpersonal contacts between windsurfers in southern California during the fall of 1986.

Baselines

Network dissimilarity based on shortest path distance distribution

Suppose the shortest path distance distribution of node is denoted by , in which is defined as the fraction of nodes at distance j from node . Network node dispersion measures the network connectivity heterogeneity in terms of shortest path distance and is defined by the following equation:where represents the network diameter and is the Jensen-Shannon divergence of the node distance distribution. The dissimilarity measure is based on three distance-based probability distribution function vectors and is defined as follows:where , , , and α are tunable parameters, in which . The first term in Equations (4) and (5) indicates the dissimilarity characterized by the averaged shortest path distance distributions, i.e., and . The second term characterizes the difference of network node dispersion. The last term is the difference of the α-centrality distributions, in which is the complementary graph of G. We set and , which are the default settings used in (Schieber et al., 2017).

Network dissimilarity based on communicability sequence

The communicability matrix C measures the communicability between nodes and is defined as follows:where unveils the communicability between node and . Let be the normalized communicability sequence, in which (, and ). The Jensen-Shannon entropy of the sequence is expressed as follows: Given two networks and , normalized communicability sequences are given by and , respectively. We sort the values in () in an ascending order and obtain new communicability sequences as (). Therefore, the communicability based dissimilarity is defined as :

Quantification and statistical analysis

We give the average dissimilarities between a pair of networks for 100 runs and give the standard deviation of the dissimilarities between the original real networks and the networks generated after certain perturbations. The confidence interval, Pearson correlation coefficient and p value in Figure 6 are calculated by Origin.

Additional resources

This work does not include any additional resources.

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Deposited data

Network data	This paper	https://doi.org/10.5281/zenodo.6526610
Code	This paper	https://doi.org/10.5281/zenodo.6526610

Software and algorithms

Python version 3.6	Python Software Foundation	https://www.python.org
OriginPro 9.1	Data Analysis and Graphing Software	https://www.originlab.com/

14 in total

1. Emergence of scaling in random networks

Authors:
Journal: Science Date: 1999-10-15 Impact factor: 47.728

Review 2. Exploring complex networks.

Authors: S H Strogatz
Journal: Nature Date: 2001-03-08 Impact factor: 49.962

3. Modularity and community structure in networks.

Authors: M E J Newman
Journal: Proc Natl Acad Sci U S A Date: 2006-05-24 Impact factor: 11.205

4. Collective dynamics of 'small-world' networks.

Authors: D J Watts; S H Strogatz
Journal: Nature Date: 1998-06-04 Impact factor: 49.962

5. Structural reducibility of multilayer networks.

Authors: Manlio De Domenico; Vincenzo Nicosia; Alexandre Arenas; Vito Latora
Journal: Nat Commun Date: 2015-04-23 Impact factor: 14.919

6. node2vec: Scalable Feature Learning for Networks.

Authors: Aditya Grover; Jure Leskovec
Journal: KDD Date: 2016-08

7. Quantifying randomness in real networks.

Authors: Chiara Orsini; Marija M Dankulov; Pol Colomer-de-Simón; Almerima Jamakovic; Priya Mahadevan; Amin Vahdat; Kevin E Bassler; Zoltán Toroczkai; Marián Boguñá; Guido Caldarelli; Santo Fortunato; Dmitri Krioukov
Journal: Nat Commun Date: 2015-10-20 Impact factor: 14.919