Literature DB >> 30224995

Generalized Erdős numbers for network analysis.

Greg Morrison¹, Levi H Dudte², L Mahadevan^2,3,4,5.

Abstract

The identification of relationships in complex networks is critical in a variety of scientific contexts. This includes the identification of globally central nodes and analysing the importance of pairwise relationships between nodes. In this paper, we consider the concept of topological proximity (or 'closeness') between nodes in a weighted network using the generalized Erdős numbers (GENs). This measure satisfies a number of desirable properties for networks with nodes that share a finite resource. These include: (i) real-valuedness, (ii) non-locality and (iii) asymmetry. We show that they can be used to define a personalized measure of the importance of nodes in a network with a natural interpretation that leads to new methods to measure centrality. We show that the square of the leading eigenvector of an importance matrix defined using the GENs is strongly correlated with well-known measures such as PageRank, and define a personalized measure of centrality that is also well correlated with other existing measures. The utility of this measure of topological proximity is demonstrated by showing the asymmetries in both the dynamics of random walks and the mean infection time in epidemic spreading are better predicted by the topological definition of closeness provided by the GENs than they are by other measures.

Entities: Chemical Disease Gene

Keywords: centrality; epidemic spreading; network science

Year: 2018 PMID： 30224995 PMCID： PMC6124095 DOI： 10.1098/rsos.172281

Source DB: PubMed Journal: R Soc Open Sci ISSN： 2054-5703 Impact factor: 2.963

Introduction

The study of complex networks has increased enormously in recent years due to their applicability to a wide range of physical [1,2], biological [3], epidemiological [4,5] and sociological [6] systems. Two basic goals in this regard are to understand and quantify the structure of the network to better characterize the relationship between the interacting members of the network (the nodes), while also characterizing the dynamical processes on the network [6] that may shed light on the processes by which they form [7]. Understanding the topological properties of the network on both a global and local level can be useful in approaching both of these goals. Global properties of interest may include simple measures of the distribution of node properties, such as the degree distribution, strength distribution or distribution of clustering coefficients [8,9]. Community structure in the network [10-12], which partitions the network into densely connected sub-networks with more links within communities than between communities, has been extensively studied and may provide more detailed information about the relationship between nodes than simple distributions. Community structure can indicate the existence of underlying similarities between nodes in the network, and may have a great impact on dynamical processes occurring on the network (such as a random walk [13-15] or epidemic spreading [4,16,17]), and can influence the material properties of granular systems [1]. While global properties of networks can be used to assess the attributes of the nodes on an aggregate level, it is also of great interest to understand the topological properties of nodes on an individual, local level. Node centrality is the classic example of a topological measure associated with an individual node, which assesses the ‘importance’ of a node in a variety of contexts. The most basic measure of a node's centrality is simply related to its degree, a property of the node that is based solely on the local topology of its connectivity. The centrality of individual nodes can also be measured incorporating the global topology of the network in a variety of ways, including PageRank [18], betweenness [15] or random walk [13] centralities. Each of these measures reduces the global properties of the network into an individualized local measure of importance, permitting a rank-ordering of their importance in the network [19,20]. Dynamics on networks can likewise be described in terms of pairwise interactions between nodes, with the time between an origin and a destination node (e.g. sources and sinks in a random walk or the time of infection of one node given an epidemic originating at another) depending on the network topology. In many contexts [21,22], not all members of the network will necessarily agree on the importance of the same node: nodes that have a direct connection between them will be more important to each other than distant nodes in the network. Nodes that are central to the network as a whole may have very low importance from the perspective of sub-networks. The universality of importance is further complicated by the fact that we may expect the influence between a pair of nodes to be asymmetric even if they are directly connected [22] (the importance assigned by an important node towards an unimportant one is not necessarily the same as the importance assigned in the opposite direction), which may have important consequences in real-world systems [3]. The determination of a personalized measure of node importance that incorporates the global topology in an asymmetric measure is therefore an important but non-trivial problem. In this paper, we explore the use of the generalized Erdős numbers [11,23] (GENs) as a measure of topological closeness between nodes in a network. Using the GENs, we identify two measures of centrality using the pairwise importance between nodes, and show that these global centralities are highly correlated with other common centrality measures. We show that the infection times of a node originating from a source that is not a nearest neighbour in an epidemic spreading model are highly correlated with the GENs, indicating their potential utility in predicting the influence of network topology on the dynamics on networks. We further show that the infection times are better predicted by the GENs than two other commonly used measures of the non-metric distance between nodes in a network: the resistance distance and mean first passage times (MFPT) in a random walk. Finally, we show that the asymmetry in the GENs is correlated with that in the MFPT between nodes in a random walk. This work illustrates that the GENs are a useful measure of the topological closeness between pairs of nodes in a complex network, and also illustrates that a meaningful definition of closeness has the potential to bridge the gap between the topology of a network and the dynamics on the network in multiple contexts.

The generalized Erdős numbers

Topological closeness in complex networks

When nodes represent objects in a physical space [2,24-27], the distance between nodes, D, is a naturally defined (metric) measure of closeness between the objects. Objects that are physically proximate (or close to one another) of course have small D which is bounded below by D = 0, while objects that are not close have large D. Owing to the generality of networks (where nodes and edges abstractly represent ‘objects’ and ‘interactions’, respectively), there can be no guarantee of a naturally defined distance metric [2,28], and, in some cases, the network topology itself must define a measure of closeness (Δ) based solely on the matrix of weights between nodes i and j, w (with an undirected network where w = w is assumed throughout this paper). The proximity or closeness between nodes, Δ, will be small for nodes that are close to one another and large for distant nodes, with a simple and common choice being Δ = w−1 (so strongly connected nodes are ‘close’, and disconnected nodes are ‘far’). Alternatively, in an unweighted network, the length of the shortest path between a pair of nodes is a natural definition [28,29] and is the basis for the classic Erdős numbers in the context of an unweighted collaboration network [30]. Improvements on this simple measure which incorporate the effect of multiple paths between nodes (see figure 1a for a schematic diagram) include the resistance distance [14,31], self-consistent similarity measures [32] and communicability [33], to name only a few. An additional approach to defining similarity between nodes is found by positing a multidimensional ‘latent space’ of node properties [34], with the assumption that nodes that are close in the latent space are likely to be connected in the network and each node's position in the space inferred from the observed connectivity. Each of these methods incorporates the global topology of the network into a symmetric measure of closeness between pairs of nodes (Δ = Δ).

Figure 1.

Two competing requirements for global ‘closeness’ in a network with shared resources. In (a), many short paths between nodes increase the closeness between them. This is similar to the resistance distance between nodes: additional parallel paths between them reduce their resistance distance. In (b), the finite resources of the high-degree blue node suggest that it should be less close to the red node than for the lower-degree blue node above, as resources are shared also with the other neighbours. This is similar to the transition probability from the blue node in a random walk: the more connections the blue node has, the lower probability of visiting the red node.

Finite resources and asymmetric measures of proximity

Finite resources are shared in some networks, with examples including collaboration on networks (where time with one collaborator reduces the available time for others), multi-core processor components [35] (where finite memory or other hardware must be shared) and random walks (where the walker can only move to a single neighbour at a time with a transition probability P = w/W with the total strength of the node i). In the context of these networks of limited resources, closeness measures such as resistance distance may be undesirable [22], because the addition of a new edge in the network should be detrimental to some nodes (those who receive less of the finite resource due to the new edge) and beneficial to others (those who receive more due to the edge). For closeness measures based on the direct weight between nodes (where the ‘closeness’ between i and j is often taken to be w−1) or resistance distance between nodes, it is straightforward to see that the newly measured closeness between nodes i and j, for all pairs, i.e. the addition of an edge can never cause nodes to become less close to one another. This is not sensible in the context of nodes that share a finite resource with their neighbours, as shown in figure 1b: if a node i has many neighbours, each receives less of the resource than if i had few neighbours. The expectation of the influence of resource shared in figure 1 is satisfied by a number of existing measures of proximity. A quantity such as the transition probability in a random walk, P, is asymmetric and ensures that nodes are closer if they have few neighbours, pictured in figure 1b (so a walker is more likely to pass between them than if they had many connections). However, it is not a global measure of closeness because the transition probability incorporates only the nearest neighbour connections between nodes (so there is no proximity between disconnected nodes, even if multiple paths exist between them). The PageRank matrix [18] B = γP + (1 − γ)/N with γ a teleportation parameter gives a modified estimate of proximity, a uniform measure of closeness for disconnected nodes independent of the network's geometry. The more refined non-backtracking matrix [36-38], as the name suggests, captures the transition probability between pairs of nodes with the walker forbidden to retrace the previous step in the reverse direction. The non-backtracking matrix has previously been used to identify a measure of centrality that does not suffer from localization for highly connected nodes [36]. A simple measure of node proximity can be established using the non-backtracking matrix, the probability of a non-backtracking walker moving between pairs of nodes in two steps. Note that in every random-walk-based case, these measures of proximity satisfy the expectations in figure 1b (many unshared neighbours reduce Δ) but not figure 1a (many shared neighbours increases Δ): a walker on blue moves to red in two steps with 50% (100%) probability using the random walk transition matrix (non-backtracking transition matrix) regardless of the number of shared neighbours. It is useful to develop a measure of closeness that incorporates these two (sometimes seemingly contradictory) aspects depicted in figure 1: nodes are close to one another if there are many paths between them, but popular nodes are less close to their neighbours than unpopular nodes.

The GENs: measuring closeness via a weighted harmonic mean

We have recently shown [23] that the E or GENs, describing the topological closeness from node j to node i, satisfy the expected properties for the sharing of finite resources described in figure 1. The GENs on a weighted network of N nodes and M non-zero edges are defined aswhere w = 0 if nodes j and l do not share an edge. This form is chosen such that the node i is as close as possible to itself and that if j is connected to only one node k, j's closeness to i satisfies E ≡ E + w−1. If there are multiple paths between nodes, the closeness from j to i is strengthened if there is a direct connection between them but also includes a contribution from all other neighbours of j weighted by their connection strength. By choosing a harmonic mean for the form of the contribution, we bias our measure of closeness towards neighbours that themselves are close to i. There is no possibility of zero-valued E for i≠j due to the offset w−1, avoiding the possibility of a numerical instability [39] due to a vanishing denominator. E is thus always smaller for directly connected than indirectly connected nodes, as the contribution from direct connections in equation (2.1) is w2, strictly greater than w/(E + w−1) for indirect connections. The GENs are defined using the global topology of the network, and E is finite even for nodes i and j in the same component that share no neighbours (as may not be the case for more local measures of closeness [22]). In appendix A, we demonstrate a number of features of the GENs when applied to synthetic networks. For homogeneous networks such as the Erdős–Rényi (ER), whose degree distribution is sharply peaked about the mean, the topological closeness between connected nodes is likewise peaked about the mean which is proportional to the mean degree of the nodes 〈k〉, while the closeness between disconnected nodes is dominated by the network size N. Networks with heterogeneous topologies, such as the Barabási–Albert networks that have a degree distribution of P(k) ∼ k−3, likewise have a scale-free distribution of the GENs for connected nodes, indicating that the GENs are indeed able to distinguish between distinct network topologies. The nonlinear form of equation (2.1) makes analytical work intractable in all but the simplest cases, and we must generally resort to numerical work to determine the topological closeness between nodes in a network. E can be computed numerically in an iterative fashion [23], with E ≡ E(∞) and the recursive definition (with the constraint that E( = 0 continually enforced). In this paper, the iteration is halted when . The method also requires an initial guess, E(0), with E(0) = 1 used in this paper. The iterative method for evaluating equation (2.1) to determine the closeness of all nodes towards a particular node i requires evaluations (one for each neighbour of j). As there are N target nodes, a complete evaluation of the GENs requires O(NM) computations, at worst O(N3) for dense networks. This scaling is problematic for large dense networks, but the worst-case scaling of N3 is common for many existing measures of centrality [15]. We note that other pairwise measures of proximity (such as resistance distance or MFPT) will generally require a matrix inversion, at a typical cost of O(N3) and thus comparable to the cost of evaluating the GENs. We also note that the evaluation of the set {E1} is independent of the evaluation of {E2,}, meaning the calculation of the GENs can be parallelized to provide a significant boost in the speed of evaluation. In addition to other existing measures of proximity that satisfy the expectations of figure 1, there is a great deal of functional freedom in writing equation (2.1). For example, any measure of the form will satisfy the desired behaviour depicted in figure 1 for a monotonically decreasing g(x), with g(x) = x−1 in the definition of equation (2.1). Another alternative definition replaces the direct weight between adjacent nodes, w−1, with the closeness, E, in the denominator of equation (2.1): (with the constraint E = 0 and E > 0 imposed). While these alternative definitions may be of interest in certain contexts, we continue to use equation (2.1) throughout this paper, due to its simplicity and previously demonstrated successes in prediction algorithms [23] and community detection methods [11]. Variations in the definition of E will certainly change the numerical values of the closeness, but the qualitative behaviour of the closeness between nodes is expected to be robust to perturbations of the definition of the GENs.

Centrality and topological closeness

Erdős centrality and mean importance

The GENs incorporate a simple idea of what is meant by the ‘closeness’ between nodes in a network where limited resources are shared, and we expect that a node j that is topologically close to node i (having small E) considers node i to be ‘important’ in some sense. We may therefore regard the inverse of the closeness between nodes (ψ = E−1) as an unnormalized personalized measure of importance, allowing a ranking of all nodes in the network from the perspective of the node j. Because ψ measures the importance of i from a particular node j (rather than the network at large), it is not equivalent to a centrality measure. Having defined a pairwise measure of the importance a node j assigns to i using ψ, we naturally expect that we can leverage this definition into a global measure of the importance of node i. There already exists a wide variety of methods for measuring centrality from a global perspective, including the degree [15,40,41], PageRank [18,41], random walk [13], betweenness [13,15] and non-backtracking [36] centralities. Each measure tends to rank high-degree nodes above low-degree nodes in complex networks, but take the global network topology into account in different ways. The importance of global topology is perhaps most clear in betweenness centrality, where high-degree nodes often have high centrality, but nodes of low degree that act as bridges between components of the network may have high centrality. To convert our personalized importance measures into a single global measure for an unweighted network, we define as the sum of the importance the neighbours of i assign to it (akin to the approach of [32]), which we refer to as an Erdős centrality. In figure 2a, we compare Ψ to a variety of other measures of centrality for a single realization of a Barabási–Albert network [7] (generated using the algorithm described in appendix B) with N = 512 and 〈k〉 = 4. In all cases, there is correlation between these various measures but with differences between the numerical values of the centrality measures for both central and non-central nodes alike. The clear correlation seen here is consistent with other realizations of the BA network, other values of 〈k〉, and is also seen in ER networks (not shown).

Figure 2.

Centrality for a Barabási–Albert network with 〈k〉 = 20. (a) The Erdős centrality (x-axis) compared to the five common centrality measures (y-axis) shows an obvious positive correlation overall. Circles shows degree centrality, squares PageRank, diamonds betweenness centrality, up-triangles random walk centrality and down-triangles non-backtracking centrality. (b,c) Betweenness centrality and PageRank compared to Erdős centrality on logarithmic axes, showing the clustering due to degree in one case (b, betweenness) but not the other (c, PageRank). (d) The intersection metric λ(n) is used to quantify the similarity between the top n elements of the Erdős centrality (oE(n)) and the top n elements of the other centrality measures for varying n. Figure 2b,c shows the same data plotted logarithmically for PageRank (b) and the non-backtracking (c) centralities in comparison with Ψ for one realization of the network. The degree of each node can contribute significantly to its centrality depending on the measure, and the clustering of the data in figure 2b is driven by nodes with identical degree with different nearby network topologies that lead to differing values for the GENs. Non-backtracking centrality is less dependent on node degree (as evidenced by the lack of clustering), indicating the other topological features of the network are important using this measure. The clustering of some measures of centrality tends to occur for predominantly low-degree (and thus low-centrality) nodes, and it is preferable [20,42] to focus our comparison of the different measures on high-degree nodes [19,20]. We compare the Erdős centrality ordering to the other measures of centrality using the fractional intersection between the top-n orderings [43], , with o(k) the top-k ordering using method X. In figure 2d, λ(k) is plotted for X = Ψ and X the other centrality measures, averaged over 100 realizations of the network. We see that comparison of other measures of centrality to the Erdős centrality exhibits a high degree of overlap at n = 1 with a sharp jump in λ for n≲10 in all measures. Beyond n ≳10, there is a slow variation, but all top-n lists remain similar above 80–90% with the exception of the non-backtracking centrality. Despite their different formulations, the top-n list for Ψ compares best to the list from random walk centrality (dashed turquoise line) above 90% for low- and high-degree nodes, indicating Ψ is most closely related to the random walk centrality over all node degrees.

Importance eigenvector centrality and teleportation in random walks

The Erdős centrality, described in the previous section, is a natural definition arising from the pairwise importance ψ assigned to it by all of its direct neighbours. While well correlated with other centrality measures (suggesting its utility), a significant amount of information regarding the global importance is neglected: the value of the importance assigned to nodes that are not directly connected to i are all ignored. This is true of many centrality measures, generally counting the number of direct paths between nodes to identify an overall measure of importance (degree, random walk and betweenness all proceed solely through direct links between nodes). PageRank centrality differs from a purely random-walk-based measure by accounting for indirect links between nodes through the steady state probability of a Markov process with transition probability B = γa/k + (1 − γ)/N. In this process, the random walker moves between connected nodes (randomly) with probability γ, but jumps between disconnected nodes (again, randomly) with probability (1 − γ). The leading eigenvector of the matrix B reduces to solving the coupled equations with C the set of nodes connected to i (in a directed network, this is the set of nodes with edges directed towards i). In the limit of γ = 0, Pr = N−1 is uniform as is expected for pure teleportation. In the limit of γ = 1 (no teleportation), the PageRank equation reduces to , and it is straightforward to see that the anzatz Pr = k/N is a solution (as the equation becomes ). A uniform probability of teleporting between distant nodes may be an imperfect model for the dynamics of a random walker on a network and a number of modifications to the PageRank algorithm have been proposed that account for inhomogeneous teleportation probabilities between nodes [44,45] in a variety of contexts. A similar Markov process strongly related to the PageRank algorithm can be defined using personalized importance: a random walk performed with a transition probability (with the convention B′ = 0, meaning the walker never remains at i). This process has an interpretation similar to that of PageRank: the most probable transition for a walker at node j to make passes through direct connections (moving to i with w > 0), but has a non-zero probability of jumping to a disconnected node. Unlike the PageRank methodology, a walker in this process has a non-uniform probability of choosing to move along an edge versus teleportation. As an example of the heterogeneity of the teleportation in this process, a node i with degree k = 1 in an unweighted network will have a most probable transition to its sole neighbour (with the greatest importance j assigns going to i with ψ = 1). However, the total probability of teleporting (moving from i to a node without a direct connection) is . In appendix A, we show that the average closeness felt between disconnected nodes in a large network scales as E ∼ N1/2, which suggests that . This indicates that walkers at low-degree nodes will usually teleport to more important nodes in the network (as for large N). Teleportation between distant nodes in the network will be highly heterogeneous in this walk, and we expect it to have a significant contribution to the centrality for large networks with low-degree nodes. The leading eigenvector of the matrix B′ can be compared to that of the PageRank transition probability matrix B, which has a uniform probability of teleporting to any node in the network (regardless of the network topology). In figure 3a, we show the steady-state probability of being found at a node i for this random walker in this process, computed from the leading eigenvector of B′ with elements g, termed importance eigenvector centrality in this paper. A clear correlation with the degree centrality is observed, with the solid line indicating a scaling of g ∝ k for α ≈ 0.55. A similar quality of fit is found for larger N (discussed further below) as well as for the ER networks (not shown). Excellent agreement is found for high-degree nodes (as was the case in §3.1 for the Erdős centrality), with deviations occurring primarily for low-degree nodes that are clustered based on the node's degree. For all nodes of a fixed degree k, PageRank will tend to give a higher centrality to those nodes that are connected to high-degree hubs. By contrast, importance eigenvalue centrality g will tend to give a lower centrality as the hub's attention is divided among many nodes and it assigns a lower importance to its neighbours. This effect produces the downward slope in the clusters of data in figure 3a, and is more pronounced for low-degree nodes.

Figure 3.

Importance eigenvector centrality g extracted from the transition matrix defined by pairwise importance B′. (a) Shown are 10 realizations of BA networks with N = 512 nodes: 〈k〉 = 20 (red) and 〈k〉 = 4 (blue). An approximate scaling of g ∝ k is observed, with the best fit of α = 0.55 for the different ensembles. The behaviour of ER networks is similar, but with greater clustering of the observed PageRank values (not shown). (b) Comparison of the importance eigenvector centrality g with PageRank at γ = 1 (filled circles, pure random walk) and γ = 0.85 (empty circles, 15% teleportation probability) for the largest connected component of the political blogs network [46]. The dashed line shows a scaling of g2 ≈ Pr. Disagreement between the two methods in PageRank's teleportation parameter primarily effects the ordering of low-degree nodes, which become more homogeneous for increasing γ. The relationship between PageRank and the importance eigenvector centrality g persists even for real-world networks with neither a homogeneous nor scale-free degree distribution, such as the lognormally distributed 2004 political blogs network [46]. In this network, each node is a liberal or conservative blog in the lead-up to the 2004 presidential election and each edge indicates a link between the blogs. In order to implement the GENs in equation (2.1) on this network, we converted the network from a directed network (where w≠w) to an undirected network (where w = max(w, w)) and retained only the largest connected component of 1222 nodes. In figure 3b, we see g2 and Pr are both highly correlated with the degree centrality (R2 = 0.999 and 0.982, respectively), indicating that both measures are dominated by node degree rather than other details of the network topology (as was the case in the BA networks in figure 3a). In the case of PageRank, this is due to the fact that hubs are connected to low-degree nodes, so walkers on low-degree nodes tend to move towards high-degree nodes if they do not teleport (occurring 85% of the time). In the case of importance eigenvector centrality, the model is entirely different: with more than 90% probability walkers on low-degree nodes (k ≲ 10) will teleport, but preferentially teleport to high-degree nodes. Despite the different dynamics in the walks, the steady-state probability of arriving at any node is nearly identical in both cases.

Understanding dynamics on networks through topological closeness

SIR model on an ER network

The spreading of an epidemic has been studied by many authors and in a wide range of contexts [16,17,47-49], with the susceptible-infected-recovered (SIR) model being one of the simplest and most commonly used models. The SIR model assumes that a population of susceptible individuals becomes infected due to interactions with previously infected individuals, and infected individuals may recover and become non-infectious. A simple schematic of the SIR model is shown in figure 4a, with infections occurring at a constant rate, rI, due to direct interactions between individuals, and the recovery at constant rate, rR. A number of more complex models have been considered extensively for a homogeneously mixed population of individuals [49], but non-uniform interactions between individuals, represented by networks, can have a profound impact on the dynamics of epidemic spreading in the SIR model [4,16,17]. The existence of epidemic thresholds [4,50] for homogeneous networks (or the lack thereof for scale-free networks [16]) are well-studied global quantities of interest [51], while more local quantities such as the probability of a particular node i becoming infected, sparking an epidemic [52], and quarantine or immunization strategies [48,53] have also been examined.

Figure 4.

The harmonic mean of the infection time of node j with a single initially infected node i, h in an ER network. The SIR model is diagrammed in (a). (b–e) compare h with the GENs E (b,d) for N = 512 and 1024 respectively on log–log axes, and the MFPT τ (c,e) for N = 512 and 1024, restricted to nodes with k > 4 in all cases. The x-axes are scaled by the mean to permit comparison, and do not affect the scaling. Different colours denote different values of i, and the dashed lines denote the best fit of h∝x for x = E or τ, respectively. The variations in τ in c and e relative to the best fit are significantly larger than for E in b and d. (f) The standard deviation of the residuals various fits (with a lower value indicating a stronger relationship between x and h) as a function of the recovery rate for N = 512 and 〈k〉 = 4. (f) shows the deviation for high-degree nodes k > 4 (the behaviour is shown in appendix C for all k), with lower values of σ indicating better agreement between the observed infection times and the best fits based on the measure of closeness x. While it is clearly useful to understand the global properties of the epidemic (such as the expected number of infected individuals), a particular individual j may also be interested in its own probability of becoming infected given the origin of the disease and may reasonably be less concerned if no neighbours are infected than if many neighbours are infected. However, it is not straightforward to analytically calculate how long the disease will take to reach j from any point in the network, and it would be useful to have a measure for how ‘close’ the epidemic is from an individual node. If the infection begins with a single node i, we expect that the disease will more rapidly propagate to nodes for which i is topologically close, and it is therefore worthwhile to compare the pairwise infection times (infection time of node j given an initial infection at i) with measures of topological closeness, such as the resistance distance R, MFPT in a random walk τ, and the GENS E. PageRank and betweenness are single-node properties (not properties of a pair) and cannot be used for comparison. The resistance distance and MFPT in a random walk can be computed directly from the graph Laplacian L [14,15]. To see the relationship between infection time and topological closeness, we simulate an SIR epidemic (diagrammed in figure 4a), using Gillespie dynamics [54] on an ER graph (with a uniform probability of connection and each node having 〈k〉 = 4 or 〈k〉 = 20) and N = 512. The infection rate rI = 1 and recovery rate are varied, but always above the epidemic threshold [4,16] rI > rR/〈k〉. Even above the epidemic threshold, the disease may stochastically die off, and we take the pairwise infection time to be the harmonic mean of the infection time of a node j given an initial infection at i over all of the simulations, with K simulations initiated at site i for each rR. To compute the infection time h between all nodes, K = 100 simulations were run for every node i being the sole infected node at t = 0.

Comparing topological closeness with infection time

The infection time can be compared to a variety of measures of topological closeness, and in this section we focus on the GENs (E), the MFPT in a random walk (τ) and the resistance distance (R). Infection that originates at a high-degree node (i) will rapidly spread throughout the network, but infections starting at a low-degree node will tend to spread only locally until a high-degree node is encountered. We thus expect the rate of infection of a non-nearest neighbour (j) of the initial infection site i to be positively correlated with its topological closeness using all three measures. In figure 4b–e, we compare h in a network with N = 512 and 〈k〉 = 4 to E (b, d) and τ (c, e), normalized by (since the GENs do not contain any dynamic information and the numerical values are thus arbitrary) and (for comparison with the GENs), respectively. The figures show a random sample of 20 target nodes j with k > 4 (for which there is a consistent relationship for 〈k〉 = 4, discussed further in appendix C). As expected, infection times of non-nearest neighbours are lowest for nodes that are topologically close (low E or τ), with the lines showing an empirical power-law fitting of h ∝ x for x = E or τ. The exponent is non-universal, depending on N, 〈k〉 and the recovery rate. It is apparent that the fit using the GENs is more robust than the MFPT, due to the clustering of τ (akin to the degree-driven clustering in figure 2b) with larger variation in h for a given value of τ than is seen for E. This is driven by the fact that τ is much more strongly correlated with the degree of the target node j than is h (shown in appendix C). The comparison of h with R has a trend similar to τ, and is not shown in the figure. The quality of the fit between the infection time h and any of the measures of closeness x are shown in figure 4f using the standard deviation of the residuals for the power law best fit h = cx. The mean of the residuals generally satisfies |m| ≲ 10−3 for all measures at all rR. Figure 4f shows that all closeness measures perform worse when rR increases, due to the fact that node recovery is independent of the network topology. The figure also clearly demonstrates that the GENs are a significantly better predictor of the infection time than either the MFPT or resistance for spreading on an ER network, indicating that they correspond to a relevant measure of topological closeness that has an impact on the spreading process. For an ER network with 〈k〉 = 20, all nodes have degree k > 4 with high probability, and in this case the results are consistent with those pictured in figure 4b–f without restriction on the degree. For 〈k〉 = 20, we find that σ increases overall for each measure of proximity (all on the order of σ ≈ 0.3 − 0.4 for rR/rI ≈ 0), as shown in appendix C. Consistent with the behaviour in figure 4, σE is lower than σ and σR for non-zero rR/rI, indicating that the GENs remain a better predictor overall than resistance distance or MFPT.

Random walks and the GENs

A surprising feature of figure 4 is the significant difference between the accuracy of E and τ in predicting the infection time. Based on the good agreement between the importance centrality Ψ and random walk centrality c in figure 2d, one might have expected to find consistency between the GENs and the MFPT in a random walk. Random walk centrality is defined based on the differences in MFPT [13], with τ − τ = c − c, rather than the particular values of τ themselves. The MFPTs are asymmetric (τ > τ if i is more easily reached than j), as it is easier to reach a high-degree node than a low-degree node, with a similar behaviour for the GENs (with E > E if i is topologically closer to j than j is to i). This suggests a comparison of the asymmetry between the two measures that could explain their agreement in figure 2d. In figure 5, we compare ΔE = E − E to the difference in the MFPT between nodes Δτ = τ − τ for an ER network with various N and 〈k〉. The asymmetry in the MFPT is highly correlated with the asymmetry in the GENs, with an empirical scaling of and α ≈ 4 (determined using Mathematica's FindFit function). The fact that Δτ ∝ ΔE (even when there are no direct connections between i and j) again indicates that the GENs are able to capture the importance of the global network topology even for distant nodes.

Figure 5.

Asymmetry in the Erdős–Rényi GENs ΔE = E − E compared with the asymmetry in the MFPTs for those networks, Δτ = τ − τ. The colours indicate the probability p of seeing that value of the ΔE − Δτ pair (counts normalized by the total number of pairs in the simulated networks). In these density plots, darker colours correspond to a greater observed frequency of the same (ΔE, Δτ) pair. Shown are two values of N = 512, 1024 and 10 000 nodes as well as two values of 〈k〉 = 4 and 20, as indicated in the figure. The asymmetry of the GENs is highly correlated to that in the MFPT (with the slope of the best fit line indicated).

Conclusion

In this paper, we have shown the utility of the GENs in measuring a non-metric topological closeness between nodes in complex networks lacking a well-defined distance metric. Derived from simple principles based on a conceptual picture of nodes sharing finite resources, the GENs incorporate the global topology of the network into a pairwise measure of closeness for connected and disconnected nodes alike. Other non-local pairwise measures can be found in the literature (e.g. the MFPT in a random walk or resistance distance between nodes), and we have shown that the GENs are able to describe the structure of and dynamics on networks in a manner consistent with or outperforming these existing measures. The utility of the GENs was first demonstrated by identifying two potential measures of centrality derived from the GENs that identify important nodes in heterogeneous networks consistent with existing methods. The Erdős centrality, (with ψ = E−1), defines centrality in terms of the importance assigned by nearest neighbours and is appropriate for unweighted networks. An alternative measure of centrality that takes the importance assigned between all node pairs i and j into account arose from a novel definition of a random walk with teleportation: the importance eigenvector centrality was defined as the steady state probability of being found in a node i in a walk with transition probabilities p ∝ E−1. This is conceptually related to the teleportation probability in PageRank, but with our eigenvector centrality having an inhomogeneous teleportation probability depending on the importance of each node. In both cases, we showed that these centrality measures are consistent with existing approaches despite the very different origins they all have. The GENs were further shown to be useful in quantifying the impact of the network topology on the dynamics on epidemic spreading on an ER network. Nodes that are disconnected but topologically close in a network should more quickly spread the infection between each other than nodes that are distant. While the resistance distance and MFPT in a random walk are both positively correlated with infection time (as expected), the GENs are an overall better predictor for high-degree nodes. We note that the dynamics of the SIR model were not chosen to match the dynamics of the epidemic spreading, as the SIR model does not have a finite resource shared between nodes (as each node can infect all of its neighbours with equal rate). The GENs are expected to perform well on predicting the infection risk of nodes for other disease models in which the process of infecting one node may reduce the infection rate of other neighbours. Taken together, the quality of the centrality measures and the correlation with dynamical processes on networks suggest that the GENs are a meaningful measure of topological proximity and may be of potential benefit in a variety of contexts.

30 in total

Generalized Erdős numbers for network analysis.

Introduction

The generalized Erdős numbers

Topological closeness in complex networks

Finite resources and asymmetric measures of proximity

The GENs: measuring closeness via a weighted harmonic mean

Centrality and topological closeness

Erdős centrality and mean importance

Importance eigenvector centrality and teleportation in random walks

Understanding dynamics on networks through topological closeness

SIR model on an ER network

Comparing topological closeness with infection time

Random walks and the GENs

Conclusion

1. Emergence of scaling in random networks

2. Modular organization of cellular networks.

3. A note on a paper by Erik Volz: SIR dynamics in random networks.

4. Asymmetric coevolutionary networks facilitate biodiversity maintenance.

5. Network epidemic models with two levels of mixing.

6. Multiscale mobility networks and the spatial spreading of infectious diseases.

7. Quarantine-generated phase transition in epidemic spreading.

8. Influence of network topology on sound propagation in granular materials.

9. Dynamics of ranking processes in complex systems.

10. Predicting the epidemic threshold of the susceptible-infected-recovered model.