Literature DB >> 31182773

Identifying influential spreaders by gravity model.

Zhe Li1, Tao Ren2, Xiaoqi Ma3, Simiao Liu1, Yixin Zhang1, Tao Zhou4.   

Abstract

Identifying influential spreaders in complex networks is crucial in understanding, controlling and accelerating spreading processes for diseases, information, innovations, behaviors, and so on. Inspired by the gravity law, we propose a gravity model that utilizes both neighborhood information and path information to measure a node's importance in spreading dynamics. In order to reduce the accumulated errors caused by interactions at distance and to lower the computational complexity, a local version of the gravity model is further proposed by introducing a truncation radius. Empirical analyses of the Susceptible-Infected-Recovered (SIR) spreading dynamics on fourteen real networks show that the gravity model and the local gravity model perform very competitively in comparison with well-known state-of-the-art methods. For the local gravity model, the empirical results suggest an approximately linear relation between the optimal truncation radius and the average distance of the network.

Entities:  

Year:  2019        PMID: 31182773      PMCID: PMC6557850          DOI: 10.1038/s41598-019-44930-9

Source DB:  PubMed          Journal:  Sci Rep        ISSN: 2045-2322            Impact factor:   4.379


Introduction

Network science is playing an increasingly significant role in many domains including physics, sociology, engineering, biology, management, and so on[1]. The heterogeneous nature of real networks[2] asks for a crucial question: How to quantitatively measure a node’s importance in a dynamical process? Taking spreading dynamics as an example, a popular star in Twitter may remarkably accelerate a rumor and a few superspreaders could largely expand the epidemic prevalence of a disease[3]. Therefore, a good answer to the above question, namely an efficient algorithm to identify influential spreaders in complex networks, can help to better control the outbreak of an epidemic[4], optimize the use of limited resources to facilitate the dissemination of information[5], prevent catastrophic disruptions of power grid or the Internet[6], discover the candidates of drug target and essential proteins[7], and so on. Till far, most known methods only make use of the structural information[8], which can be roughly classified into neighborhood-based centralities and path-based centralities. Typical representatives of the neighborhood-based centralities are degree centrality[9] (DC), H-index[10] and k-shell decomposition method[11] (KS). For DC, nodes with larger degrees are more influential. For H-index, nodes connecting with many large-degree neighbors are more influential. KS assigns a k-shell index to each node based on its topological location, where nodes closer to the core of the network will get higher k-shell indices, and nodes in the periphery will get lower k-shell indices. The nodes with higher k-shell indices are considered to be more influential. Besides, PageRank[12] and LeaderRank[13] are two representative neighborhood-based iterative methods, both suggesting that the influence of a node is determined by the influences of its neighbors. Two well-studied path-based centralities are closeness centrality[14] (CC) and betweenness centrality[15] (BC). CC claims that a node averagely closer to other nodes is more influential while BC assumes that a node locating in many shortest paths is of high influence. Inspired by the gravity law, recently, Ma et al.[16] proposed two gravity-law-based algorithms by considering both neighborhood information and path information (see Methods for the details of algorithms). Analogously, we proposed a variant algorithm named gravity model (GM), which also takes into account both neighborhood information and path information, where a node with larger degrees (neighborhood information) and averagely shorter distances to other nodes (path information) is more influential. Furthermore, we propose a local version of the gravity model (named as local gravity model, LGM for short) to lower the computational complexity and reduce the possible noise caused by interactions at distance. Such local model only accounts for pairwise interactions within a truncation radius. Empirical results show that GM and LGM perform very competitively in comparison with well-known state-of-the-art methods. In particular, for LGM, an empirically linear relation between the optimal truncation radius and the average distance of the network is observed.

Results

Algorithms

Individually speaking, nodes with large degrees are likely to be more influential. In addition, a node is of higher impacts on nearby nodes[17]. According to the above issues and inspired by the gravity law, we regard the degree of a node as its mass, and the shortest distance between two nodes as their distance. Hence a node i’s influence can be estimated aswhere k is the degree of node i, d is the shortest distance between node i and node j, and j runs over all nodes other than i. Obviously, a node with many neighbors and be close to most nodes is more influential according to Eq. 1. Such method is named as gravity model as it adopts the formula of the gravity law. Although GM can identify the nodes averagely closer to other nodes and with larger degrees, it has two shortcomings. Firstly, to calculate shortest distances between all node pairs is time-consuming for large-scale networks[18]. Secondly, in real propagation a node is hard to impact other nodes at distance and to estimate the interacting strength between distant nodes is usually inaccurate since the step-by-step decaying influence will be disturbed by accumulated noise[19]. Therefore, by introducing a truncation radius, we only consider the pairwise interactions within the truncation radius. Hence a node i’s influence can be estimated aswhere R is the truncation radius. Such method (Eq. 2) is named as local gravity model as it only takes into account local information of the network.

Data description

In this paper, fourteen real networks from disparate fields are used to test the performance of GM and LGM, including three collaboration networks (Jazz, NS and GrQc), four communication networks (EEC, Email, PG and Enron), four social networks (PB, Facebook, WV and Sex), one transportation network (USAir), one infrastructure network (Power) and one technological network (Router). Jazz[20] is a collaboration network of jazz musicians. NS[21] is a co-authorship network of scientists working on network science. GrQc[22] is a collaboration network of eprint articles in arXiv categories General Relativity and Quantum Cosmology. EEC[23] describes email interchanges between institution members of a large European research institution. Email[24] describes email interchanges between users including faculty, researchers, technicians, managers, administrators, and graduate students of the Rovira i Virgili University. PG[22] is a snapshot of the Gnutella peer-to-peer file sharing network from August 2002. Enron[25] is the Enron email network. PB[26] is a network of US political blogs. Facebook[27] describes social circles from Facebook. WV[28] is a network of Wikipedia who-votes-on-whom. Sex[29] is a bipartite network in which nodes are females (sex sellers) and males (sex buyers) and links between them are established when males write posts indicating sexual encounters with females. USAir[30] is the US air transportation network. Power[31] is the power grid of the western United States. Router[32] is a symmetrized snapshot of the structure of the Internet at the level of autonomous systems. These networks’ topological features (including the number of nodes, the number of links, the average degree, the average distance, the clustering coefficient[31], the assortative coefficient[33], the degree heterogeneity[34] and the epidemic threshold[35] of the SIR model[36]) are shown in Table 1.
Table 1

The basic topological features of the fourteen real networks.

Networks N E kd C r H β c
Jazz198247227.69702.23500.63340.02021.39510.0266
NS3799144.82326.04190.7981−0.08171.66300.1424
GrQc4158134226.45606.04940.66480.63922.78520.0589
EEC9861606432.58422.58690.4505−0.02572.29120.0136
Email113354519.62223.60600.25400.07821.94210.0565
PG6299207766.59664.64300.01500.03552.67640.0600
Enron3369618081110.73194.02520.7081−0.116513.26550.0071
PB12221671427.35522.73750.3600−0.22132.97070.0125
Facebook40398823443.69103.69250.61700.06362.43920.0095
WV706610073628.51293.24750.2090−0.08335.09920.0069
Sex15810385404.87545.78460−0.11455.82760.0365
USAir332212612.80722.73810.7494−0.20793.46390.0231
Power494165942.669118.98920.10650.00351.45040.3483
Router502262582.49226.44880.0329−0.13845.50310.0786

N and E are the number of nodes and links. 〈k〉 and 〈d〉 are the average degree and the average distance. C and r are the clustering coefficient and the assortative coefficient. H is the degree heterogeneity. β is the epidemic threshold of the SIR model.

The basic topological features of the fourteen real networks. N and E are the number of nodes and links. 〈k〉 and 〈d〉 are the average degree and the average distance. C and r are the clustering coefficient and the assortative coefficient. H is the degree heterogeneity. β is the epidemic threshold of the SIR model.

Empirical results

We apply the well-known SIR model[36] to compare the rankings of influences produced by algorithms and simulations. Initially, one node (called seed) in the network is in the infected state (I) and the others are in the susceptible state (S). Each of the infected nodes can infect its susceptible neighbors with probability β. And in each step, every infected node changes to be recovered and will never participate in the dynamics with probability λ. The spreading process repeats until there are no more infected nodes in the network. The influence of any node i can be estimated bywhere N is the number of recovered nodes at the end of the dynamics. For simplicity, we set λ = 1, and the corresponding epidemic threshold[34] iswhere 〈k〉 and 〈k2〉 denote the average degree and the second-order moment of the degree distribution. Given a network and the transmission probability β, to obtain the standard ranking of nodes’ influences, we implement 1000 independent runs, in each run every node is selected once as the seed once. The accuracy of an algorithm is measured by the Kendall’s Tau (τ)[37] between the standard ranking and the ranking by the algorithm (see details in Methods). A larger value of τ means a stronger correlation between the two sequences and thus a better performance. Table 2 compares the accuracies of the two proposed algorithms (i.e., GM and LGM) and seven benchmark algorithms (see details about the benchmark algorithms in Methods). The transmission probability for each case is fixed as β = β (for more values of β, see Fig. 1) and the parameters in relevant algorithms are all adjusted to their optimal values subject to the largest τ.
Table 2

The algorithms’ accuracies for β = β, measured by the Kendall’s Tau (τ).

NetworksBCCCDCH-indexKSGG+GMLGM
Jazz0.45900.70430.80880.84170.76080.86770.90250.85330.8634
NS0.29790.34150.57280.55610.50510.81100.84640.76110.8231
GrQc0.32310.54640.64430.63620.61150.83370.79220.76840.8417
EEC0.71510.86100.84680.86410.85250.89430.91890.88030.9022
Email0.62540.81040.76650.78870.77070.87200.90760.82650.8671
PG0.56050.69160.59410.62160.58970.69920.70820.66320.6900
Enron0.33870.42410.46570.46540.46360.48590.46100.50550.5075
PB0.68390.78650.85800.87320.86330.90010.92110.88870.9067
Facebook0.44500.33620.67040.69480.69650.71170.73610.71600.7394
WV0.63050.67480.67630.67880.67780.69190.69170.68950.6926
Sex0.42510.61190.47740.48890.49340.66060.63860.60920.6713
USAir0.51810.80520.73200.75250.74700.85140.90120.82860.8817
Power0.32050.36530.42070.39350.30840.66100.75440.61280.6947
Router0.30590.51200.31070.19170.17910.62160.62260.57820.6441

The best performed algorithm for each network is emphasized by bold.

Figure 1

The algorithms’ accuracies for different β, measured by the Kendall’s Tau (τ).

The algorithms’ accuracies for β = β, measured by the Kendall’s Tau (τ). The best performed algorithm for each network is emphasized by bold. The algorithms’ accuracies for different β, measured by the Kendall’s Tau (τ). As shown in Table 2, both GM and LGM are very competitive. In particular, G+ and LGM perform best among the nine algorithms. Notice that, G+ also adopts the gravity formula[16] (see Methods) but a node’s mass in G+ is defined as its k-shell index so G+ is indeed a global index. The results reported in Table 2 demonstrate the advantage of gravity models (e.g., G, G+, GM, LGM) and show that a local index (LGM) can outperform most benchmark algorithms including some global indices. As shown in Fig. 1, results for other values of β not too far from the threshold are consistent to the one at β, suggesting the robustness of our findings. Since to determine the optimal truncation radius, denoted by R*, asks for more computation, we want to see whether topological information can be used to profile R*. As shown in Fig. 2, R* approximately scales linearly with the average distance, asat β = β. Such approximately linear relation also holds for other values of β not so far from β. This empirical relation can save computational cost in practice.
Figure 2

The relation between R* and 〈d〉 for β = β. Fourteen pentagrams represent fourteen networks and the slope of the blue line is 1/2. The pentagram in black is the outlier – the Enron network. Although the optimal truncation radius R* = 7 is much different from what Eq. 5 predicts (i.e., R = 2), the algorithmic accuracy at R = 2 (τ = 0.4949) is very close to the best accuracy at R* = 7 (τ = 0.5075) in comparison with the traditional methods (e.g., about 0.34 for BC, 0.42 for CC and 0.46 for DC, KS and H-index). That is to say, to apply Eq. 5 can still achieve much better algorithmic performance than the traditional methods.

The relation between R* and 〈d〉 for β = β. Fourteen pentagrams represent fourteen networks and the slope of the blue line is 1/2. The pentagram in black is the outlier – the Enron network. Although the optimal truncation radius R* = 7 is much different from what Eq. 5 predicts (i.e., R = 2), the algorithmic accuracy at R = 2 (τ = 0.4949) is very close to the best accuracy at R* = 7 (τ = 0.5075) in comparison with the traditional methods (e.g., about 0.34 for BC, 0.42 for CC and 0.46 for DC, KS and H-index). That is to say, to apply Eq. 5 can still achieve much better algorithmic performance than the traditional methods.

Discussion

To measure influences of nodes in a certain networked dynamics, a straightforward method is to estimate the interacting strengths between node pairs in advance. The gravity law is a simple, elegant and representative formula that estimates the interacting strength between two nodes by simultaneously considering the intrinsic influences of the two nodes themselves and the distance between them. In this paper, the gravity model (Eq. 1) makes use of both the neighborhood information and the path information, which were separately adopted in many previous methods. Furthermore, to reduce the computational complexity and to avoid the accumulated noises through long paths, we proposed a local version of the gravity model (LGM, see Eq. 2). Both GM and LGM are very competitive, and of particular interests, the LGM requires less computation yet performs even better. Indeed, LGM is one of the two best-performed methods among many well-known benchmark algorithms. A potential disadvantage of LGM is that it has a free parameter, namely the truncation radius R. The negative effects of the existence of R are twofold. Firstly, it asks for more computation to determine the optimal value of R. Secondly, if the optimal value, say R*, is very large, the computational complexity of LGM will be more or less the same to GM. Fortunately, as shown in Fig. 2, we found an empirical relation between R* and the average distance 〈d〉, so that if the computational resource is highly limited, we can use the relation (see Eq. 5) to approximate R*. In addition, since most real networks are of small-world property[31,38], R* should be small and thus it requires much less computation than GM. Fortunately, the difference between two rankings of nodes produced by neighboring R will quickly converge to a very small value, so that to choose a small value of R will probably perform very well. In Table 3, we show the values of τ(R), which is the Kendall’s tau between two rankings of nodes’ influences with truncation radius being R and R + 1. One can observe that after R = 5, all networks are of τ(R) > 0.97 and a half of them are of τ(R) > 0.99. This indicates a strong saturation, namely the increasing of R will produce almost the same rankings if the value of R is already large.
Table 3

The Kendall’s Tau between two rankings of nodes’ influences produced by the LGM with truncation radius R and R + 1.

NetworksR = 1R = 2R = 3R = 4R = 5
Jazz0.97480.99270.99760.99810.9993
NS0.93480.96290.97520.97970.9829
GrQc0.91970.91610.93800.96280.9721
EEC0.97730.98820.99630.99780.9988
Email0.95960.97700.98400.99270.9963
PG0.94130.95960.97660.98860.9957
Enron0.84790.89580.92740.96110.9793
PB0.96820.98650.99560.99770.9984
Facebook0.87970.94310.97680.98420.9899
WV0.96680.97600.99580.99820.9989
Sex0.90390.90420.95000.96150.9712
USAir0.96070.96970.98580.99120.9939
Power0.94860.96720.97170.97540.9785
Router0.84160.90070.94020.96000.9720
The Kendall’s Tau between two rankings of nodes’ influences produced by the LGM with truncation radius R and R + 1. Another similar model (named G+, see Eq. 11) shows very close performance to LGM. In comparison, LGM is more efficient since it completely depends on the local topological structure and thus can be calculated not only faster but also under the case where the global topology is not known. In the absence of global topology, G+ cannot be obtained since it sets a node’s k-shell index as its mass, and to determine the k-shell index needs the knowledge of the whole network. In despite the difference between G+ and LGM, the very good performance of G+ and LGM strongly suggest the validity and advantage of the usage of the gravity law to estimate the interacting strength. Of course, both G+ and LGM are very simple and general, which can be further improved by the following aspects (also leaving as open issues for future studies). Firstly, by introducing a few tunable parameters that can adjust the relative importance of mass and distance (e.g., to replace d2 by some d and/or to replace k by some k) may result in more accurate predictions as indicated by known variants of the gravity law in other applications[39]. Secondly, we should explore how the topological features and dynamical processes affect the prediction accuracy and thus improve the original methods by introducing some topology-dependent and/or dynamics-sensitivity items[40,41]. Thirdly, the original gravity law is symmetric, while due to the different roles of different nodes or the essentially asymmetric nature of the dynamics[42,43], the influence from node i onto node j could be different from the influence from node j onto node i, where an asymmetric form of the gravity law may be relevant.

Methods

The Kendall’s Tau

The Kendall’s Tau[37] is an index measuring the correlation strength between two sequences. Considering two sequences with N elements, X = (x1, x2, …, x) and Y = (y1, y2, …, y). Any pair of two-tuples (x1, y1) and (x, y) (i ≠ j) are concordant if both x > x and y > y or both x < x and y < y. They are discordant if x > x and y < y or x < x and y > y. If x = x or y = y, the pair is neither concordant nor discordant. The Kendall’s Tau of two sequences X and Y can be calculated aswhere n+ and n− denote the number of concordant and discordant pairs, respectively. It can be seen that the extent to which τ exceeds zero indicates the strength of the correlation.

Benchmark centralities

Degree Centrality[9] of node i is defined aswhere A = {a} is the adjacency matrix, that is, a = 1 if i and j are connected and 0 otherwise. H-index[10] of node i, denoted by H(i), is defined as the maximal integer satisfying that there are at least H(i) neighbors of node i whose degrees are all no less than H(i). Such index is an extension of the famous H-index in scientific evaluation[44] to network analysis. Closeness Centrality[14] of node i is defined as Betweenness Centrality[15] of node i is defined aswhere g is the number of shortest paths between nodes s and t, and g(i) is the number of shortest paths between nodes s and t that pass through node i. Gravity Centrality[16] (G) of node i is defined aswhere k(i) is the k-shell index of node i, and ψ is the set of nodes whose distance to node i is less than or equal to 3. Extended Gravity Centrality[16] (G+) of node i is defined aswhere Λ is the set of neighbors of node i.
  8 in total

1.  Identifying influential spreaders by gravity model considering multi-characteristics of nodes.

Authors:  Zhe Li; Xinyu Huang
Journal:  Sci Rep       Date:  2022-06-14       Impact factor: 4.996

2.  Identifying Influencers in Social Networks.

Authors:  Xinyu Huang; Dongming Chen; Dongqi Wang; Tao Ren
Journal:  Entropy (Basel)       Date:  2020-04-15       Impact factor: 2.524

3.  Best influential spreaders identification using network global structural properties.

Authors:  Amrita Namtirtha; Animesh Dutta; Biswanath Dutta; Amritha Sundararajan; Yogesh Simmhan
Journal:  Sci Rep       Date:  2021-01-26       Impact factor: 4.379

4.  Identification of nodes influence based on global structure model in complex networks.

Authors:  Aman Ullah; Bin Wang; JinFang Sheng; Jun Long; Nasrullah Khan; ZeJun Sun
Journal:  Sci Rep       Date:  2021-03-17       Impact factor: 4.379

5.  Hunting for vital nodes in complex networks using local information.

Authors:  Zhihao Dong; Yuanzhu Chen; Terrence S Tricco; Cheng Li; Ting Hu
Journal:  Sci Rep       Date:  2021-04-28       Impact factor: 4.379

6.  Network-Based Approach and IVI Methodologies, a Combined Data Investigation Identified Probable Key Genes in Cardiovascular Disease and Chronic Kidney Disease.

Authors:  Mohd Murshad Ahmed; Safia Tazyeen; Shafiul Haque; Ahmad Sulimani; Rafat Ali; Mohd Sajad; Aftab Alam; Shahnawaz Ali; Hala Abubaker Bagabir; Rania Abubaker Bagabir; Romana Ishrat
Journal:  Front Cardiovasc Med       Date:  2022-01-05

7.  A new model to identify node importance in complex networks based on DEMATEL method.

Authors:  Wentao Fan; Yuhuan He; Xiao Han; Yancheng Feng
Journal:  Sci Rep       Date:  2021-11-24       Impact factor: 4.379

8.  Identifying influential spreaders in complex networks for disease spread and control.

Authors:  Xiang Wei; Junchan Zhao; Shuai Liu; Yisi Wang
Journal:  Sci Rep       Date:  2022-04-01       Impact factor: 4.379

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.