| Literature DB >> 31641200 |
Şirag Erkol1, Claudio Castellano2, Filippo Radicchi3.
Abstract
Influence maximization is the problem of finding the set of nodes of a network that maximizes the size of the outbreak of a spreading process occurring on the network. Solutions to this problem are important for strategic decisions in marketing and political campaigns. The typical setting consists in the identification of small sets of initial spreaders in very large networks. This setting makes the optimization problem computationally infeasible for standard greedy optimization algorithms that account simultaneously for information about network topology and spreading dynamics, leaving space only to heuristic methods based on the drastic approximation of relying on the geometry of the network alone. The literature on the subject is plenty of purely topological methods for the identification of influential spreaders in networks. However, it is unclear how far these methods are from being optimal. Here, we perform a systematic test of the performance of a multitude of heuristic methods for the identification of influential spreaders. We quantify the performance of the various methods on a corpus of 100 real-world networks; the corpus consists of networks small enough for the application of greedy optimization so that results from this algorithm are used as the baseline needed for the analysis of the performance of the other methods on the same corpus of networks. We find that relatively simple network metrics, such as adaptive degree or closeness centralities, are able to achieve performances very close to the baseline value, thus providing good support for the use of these metrics in large-scale problem settings. Also, we show that a further 2-5% improvement towards the baseline performance is achievable by hybrid algorithms that combine two or more topological metrics together. This final result is validated on a small collection of large graphs where greedy optimization is not applicable.Entities:
Year: 2019 PMID: 31641200 PMCID: PMC6805897 DOI: 10.1038/s41598-019-51209-6
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Identification of influential spreaders in large networks.
| Network |
|
|
| Ref. | url |
| ||
|---|---|---|---|---|---|---|---|---|
| Subcrit. | Critical | Supercrit. | ||||||
| Slashdot | 51,083 | 116,573 | 0.0262 |
[ | url | 1.003 | 1.017 | 1.062 |
| Gnutella, Aug. 31, 2002 | 62,561 | 147,878 | 0.0956 |
[ | url | 1.009 | 1.040 | 1.039 |
| Epinions | 75,877 | 405,739 | 0.0062 |
[ | url | 1.012 | 1.057 | 1.130 |
| Flickr | 105,722 | 2,316,668 | 0.0142 |
[ | url | 1.007 | 1.082 | 1.242 |
| Gowalla | 196,591 | 950,327 | 0.0073 |
[ | url | 1.011 | 1.024 | 1.066 |
| EU email | 224,832 | 339,925 | 0.0119 |
[ | url | 1.002 | 1.009 | 0.923 |
| Web Stanford | 255,265 | 1,941,926 | 0.0598 |
[ | url | 1.009 | 1.031 | 1.035 |
| Amazon, Mar. 2, 2003 | 262,111 | 899,792 | 0.0940 |
[ | url | 1.008 | 1.025 | 0.994 |
| YouTube friend. net. | 1,134,890 | 2,987,624 | 0.0063 |
[ | url | 1.004 | 1.013 | 0.952 |
|
| 1.007 ± 0.001 | 1.033 ± 0.007 | 1.050 ± 0.030 | |||||
|
| 1.001 ± 0.002 | 1.021 ± 0.003 | 1.043 ± 0.005 | |||||
We compare the performance of the hybrid method with the individual method AD. For the hybrid method, we use the values of the coefficients reported in Table 2. From left to right, we report the name of the network, number of nodes in the giant component, number of edges in the giant component, critical value p of the spreading probability, references to studies where the network was first analyzed, url where network data were downloaded, value of the ratio between the performance metric of the hybrid method and the one of the individual method AD for the subcritical, critical and supercritical regimes. The bottom two lines in the table report, for each dynamical regime, average values and standard errors of the mean for the ratios over the set of large networks and over the corpus of 100 networks considered in the rest of the paper.
Methods for the selection of influential spreaders.
| Group | Method | Abbrev. | Ref. | Complexity |
|---|---|---|---|---|
| Baseline | Greedy | G |
[ | cubic |
| Random | R | — | constant | |
| Local | Degree | D | — | linear |
| Adaptive Degree | AD |
[ | linear | |
| Global | Betweenness | B |
[ | quadratic |
| Closeness | C |
[ | quadratic | |
| Eigenvector | E |
[ | linear | |
| Katz | K |
[ | linear | |
| PageRank | PR |
[ | linear | |
| Non-backtracking | NB |
[ | linear | |
| Adaptive NB | ANB |
[ | quadratic | |
| Intermediate | k-shell | KS |
[ | linear |
| LocalRank | LR |
[ | linear | |
| h-index | H |
[ | linear | |
| CoreHD | CD |
[ | linear | |
| Collective Influence, | CI1 |
[ | linear | |
| Collective Influence, | CI2 |
[ | linear | |
| Expl. Immunization | EI |
[ | linear |
We list basic details of all the methods for the detection of influential spreaders in complex networks that we consider in this study. Each row of the table refers to a specific method. From left to right, we report the full name of the method, the abbreviation of the method name, the reference of the paper where the method was introduced, and the computational complexity of the method. Computational complexities reported in the table are obtained under the realistic assumption that methods are applied to sparse networks where the number of edges scales linearly with the network size. Methods are further grouped into different categories, i.e., baseline, local, global, and intermediate, depending on their properties.
Figure 1Relative size of the outbreak as a function of the relative size of the seed set for the email communication network of ref.[43]. To obtain relative values, we divide outbreak size and seed set size by the total number of nodes in the network. Relative measures allow for an immediate comparison across networks with different sizes. We compare the performance of different methods for the selection of influential nodes. Outbreak size is calculated for ICM dynamics at critical threshold . To avoid overcrowding, we display results only for a subset of the methods considered in the paper.
Figure 2Cumulative distribution of the relative performance (for ) obtained by using a method for the identification of influential spreaders different from the greedy algorithm. The metric of relative performance is defined in Eq. 3. The distribution is obtained considering all networks in our dataset. For every network, the outbreak size is calculated for ICM dynamics at critical threshold p. See details in the SM1. To avoid overcrowding, we display results only for the same subset of the methods as already considered in Fig. 1.
Figure 3Cumulative distribution of the precision metric defined in Eq. (4) for . The distribution is obtained considering all networks in our dataset. Results for the greedy algorithm used in the comparison are those obtained for ICM dynamics at critical threshold p. See details in the SM1. To avoid overcrowding, we display results only for the same subset of the methods as already considered in Fig. 1.
Figure 4Performance and precision of methods for the identification of influential spreaders in real networks. Results are based on the systematic analysis of 100 real-world networks. For each network, we first evaluate the critical value of the spreading probability p for ICM dynamics. Then, we consider the analysis for three distinct phases of spreading: (a) , (b) , (c) . Each point in the various panels corresponds to one method. Every method is used to identify the top T N, with , spreaders in the networks. For clarity of the figure, methods are identified by the same abbreviations as those defined in Table 1. Methods are characterized by the metrics of performance defined in the paper. Both these metrics relate the performance of a generic method m to the one of the greedy algorithm. Overall performance is a metric of performance that relies on the size of the outbreak associated with the set of influential spreaders identified by the method compared to the typical outbreak obtained with the greedy algorithm. Overall precision instead quantifies the overlap between the sets of spreaders identified by a method and those identified by the greedy algorithm. Error bars (not shown) quantifying the standard errors of the mean associated with the numerical estimates of and are of the same size as of the symbols used in the visualization.
Figure 5Pairwise comparison among methods for the identification of influential spreaders. For every pair of methods m1 and m2, we evaluated the overlap among the two sets of top T N influential spreaders found by the methods in the network using a precision metric similar to the one of Eq. (4), i.e., . We then estimated the average value of the precision over the entire corpus of real networks at our disposal. In the figure, dark colors corresponds to high values of precision; low precision values are represented with light colors. Acronyms of the methods are defined in Table 1. Methods are listed in the table according to the same order as they appear in Table 1.
Hybrid methods for the identification of influential spreaders in networks.
| Method | Features | Subcrit. | Critical | Supercrit. |
|---|---|---|---|---|
| AD |
| 1.000 | 1.000 | 1.000 |
|
| 0.993 | 0.961 | 0.931 | |
|
| 0.755 | 0.548 | 0.119 | |
| CD |
| 1.000 | 1.000 | 1.000 |
|
| 0.983 | 0.963 | 0.929 | |
|
| 0.730 | 0.525 | 0.100 | |
| B |
| 1.000 | 1.000 | 1.000 |
|
| 0.946 | 0.954 | 0.938 | |
|
| 0.590 | 0.483 | 0.110 | |
| AD,B |
| 0.718 | 0.590 | 0.023 |
|
| −0.027 | 0.046 | 0.069 | |
|
| 0.987 | 0.964 | 0.936 | |
|
| 0.755 | 0.551 | 0.116 | |
| AD,PR,LR |
| 1.189 | 1.044 | 0.115 |
|
| −0.266 | 0.145 | 0.772 | |
|
| −0.336 | −0.632 | −0.771 | |
|
| 0.991 | 0.980 | 0.971 | |
|
| 0.806 | 0.616 | 0.300 | |
| PR,LR,CD |
| 0.006 | 0.386 | 0.803 |
|
| −0.419 | −0.702 | −0.771 | |
|
| 1.028 | 0.898 | 0.088 | |
|
| 0.985 | 0.979 | 0.971 | |
|
| 0.784 | 0.597 | 0.293 | |
| AD,B,LR |
| 1.096 | 1.047 | 0.343 |
|
| −0.010 | 0.067 | 0.083 | |
|
| −0.466 | −0.565 | −0.395 | |
|
| 0.993 | 0.976 | 0.952 | |
|
| 0.810 | 0.625 | 0.220 | |
| PR,LR,EI |
| 0.304 | 0.583 | 0.740 |
|
| 0.101 | −0.251 | −0.733 | |
|
| 0.235 | 0.277 | 0.121 | |
|
| 0.973 | 0.964 | 0.970 | |
|
| 0.698 | 0.589 | 0.304 |
The table is organized in various blocks, each corresponding to a specific method. For every method m, either individual or hybrid, we report performance values for the three different dynamical regimes in terms of overall performance and overall precision . The top three blocks correspond to the best individual methods in the three regimes according to overall performance metric. The remaining blocks are for hybrid methods. In each block, the first rows report values of the coefficient c of the individual method m in the definition of the hybrid method. We report the averages for the coefficient values over 1,000 iterations of the learning algorithm. The bottom two rows in each block correspond instead to the values of the performance metrics. Errors associated with all these measures are always smaller than 0.001, and they are omitted from the table for clarity.