Literature DB >> 32288925

Early detection of dynamic harmful cascades in large-scale networks.

Chuan Zhou^1,2, Wei-Xue Lu³, Jingzun Zhang⁴, Lei Li⁵, Yue Hu^1,2, Li Guo^1,2.

Abstract

Quickly detecting harmful cascades in networks can allow us to analyze the causes and prevent further spreading of destructive influence. Since it is often impossible to observe the state of all nodes in a network, a common method is to detect harmful cascades from sparsely placed sensors. However, the harmful cascades are usually dynamic (e.g., the cascade initiators and diffusion trajectories can change over the time), which can severely destroy the robustness of selected sensors. Meanwhile the large scale of current networks greatly increases the time complexity of sensor selection. Motivated by the observation, in this paper we investigate the scalable sensor selection problem for early detection of dynamic harmful cascades in networks. Specifically, we first put forward a dynamic susceptible-infected model to describe harmful cascades, and formally define a detection time minimization (DTM) problem which focuses on effective sensors placement for early detection of dynamic cascades. We prove that it is #P-hard to calculate the objective function exactly and propose two Monte-Carlo methods to estimate it efficiently. We prove the NP-hardness of DTM problem and design a corresponding greedy algorithm. Based on that, we propose an efficient upper bound based greedy (UBG) algorithm with the theoretical performance guarantee reserved. To further meet different types of large-scale networks, we propose two accelerations of UBG: Quickest-Path-UBG for sparse networks and Local-Reduction-UBG for dense networks to improve the time complexity. The experimental results on synthetic and real-world social networks demonstrate the practicality of our approaches.

Entities: Chemical Disease Gene Species

Keywords: Diffusion networks; Early detection; Sensor placement

Year: 2017 PMID： 32288925 PMCID： PMC7102699 DOI： 10.1016/j.jocs.2017.10.014

Source DB: PubMed Journal: J Comput Sci

Introduction

Harmful cascade spreading through kinds of network structures has become more and more ubiquitous in the modern world. A contagious disease like Severe Acute Respiratory Syndrome (SARS) can spread quickly through a population contact network and lead to an epidemic [12], [34]. A computer virus on a few servers can fast spread to other servers or computers in a computer connection network [15], [17]. In a similar vein, a rumor started by a few individuals can spread quickly through the online social network [2], [26]. It is crucial to detect the harmful cascades as soon as they happen or shortly thereafter, since it allows us to study the causes and prevent further spreading of harmful influence [32], [38]. In practice, it is commonly infeasible or unaffordable to monitor all nodes at all times, and therefore a common way to detect cascades is to select important nodes where we can place sensors for monitoring [7], [19], [31], [39]. However, existing methods usually viewed the cascade data as static and deterministic, ignoring an important fact that the harmful cascades are usually dynamic and time-variant, in the sense that the cascade initiators and diffusion trajectories can change randomly over the time. Actually the dynamic brings new challenges to network monitoring, since it can severely destroy the robustness of selected sensors [22]. Meanwhile the existing sensor selection algorithms can only work for small networks and not be scalable well to large networks of the day [28]. Motivated by these observations, in this paper we investigate the scalable solutions to sensor selection problem for early detection of dynamic harmful cascades in networks. To model the dynamic property, we carry on our work from a model-driven perspective. To provide a unified framework, we model all the above examples as an infection spreading in a network G = (V, E), where V is the set of nodes (i.e. individuals) and E is the set of edges (i.e. relationships). In a population network, the infection is the disease that is transmitted between individuals. In the example of a computer virus spreading in a network, the infection is the computer virus, while for the case of a rumor spreading in a social network, the infection is the rumor. Under this unified network framework, we specifically propose a new dynamic cascade model to describe harmful cascade diffusions, and define a detection time minimization (DTM) problem S * = argmin| D(S), where k is a given parameter determined by budget or monitor capacity, S is sensor nodes set, and D(S) is detection time. The DTM problem focuses on effective sensor nodes selection for early detection of dynamic harmful cascades. We first prove the NP-hardness of DTM problem and design a corresponding greedy algorithm. Considering the limitation caused by greedy algorithm inefficiency, we then propose two alternative Monte-Carlo methods, each having its pros and cons, to estimate the #P-hard objective function D(S) efficiently. To further address the scalability issue, we propose an efficient upper bound based greedy (UBG) algorithm and its two accelerations Quickest-Path-UBG and Local-Reduction-UBG to cater for different types of large-scale networks. These two accelerations are close to UBG in performance but orders of magnitude faster than UBG in time complexity. Experiments on synthetic and real-world social networks demonstrate the practicality of our approaches.

Related work

Detecting the whole network by monitoring finite sensor nodes has been widely applied to detect water contaminations in a water distribution network [19] and virus outbreaks in a human society [7]. Some early work places sensors by topological measures, e.g. targeting high degree nodes [32] or highly connected nodes [8]. However the effects along this idea are commonly unsatisfied, since they ignore the multi-step complexity of network diffusions. The “Battle of Water Sensor Network” challenge [31] motivated a number of works to optimize sensor placement in water networks to detect contamination [16], [18]. By utilizing submodular property, they proposed to optimize the sensor placement with different criterions such as maximizing the probability of detection, minimizing the detection time, or minimizing the size of the subnetwork affected by the phenomena [24]. In their works the data are a set of deterministic scenarios and the random property of diffusion are not taken into consideration. What's more, their methods are not scalable to large networks [28]. In addition to the above two types of methods, another related work [1] set sensors along fixed paths in the network so as to gather sufficient information to locate possible contaminations. Early detection of contagious outbreaks by monitoring the neighborhood (friends) of a randomly chosen node (individual) was studied by Christakis and Fowler [7]. Krause et al. [20] presented efficient schedules for minimizing energy consumption in battery operated sensors, while other works analyzed distributed solutions with limited communication capacities and costs [13], [21], [22]. In addition, Berry et al. [4] equated the placement problem with a p-median problem. Li et al. [25] proposed a dynamic-programming (DP) based greedy algorithm which is with a near-optimal performance guarantee. By contrast, our work is geared to investigate the scalable sensor selection problem for early detection of dynamic harmful cascades in networks from a model-driven prospective. The main difference from previous works is that we take dynamic properties of cascades into consideration: (1) the cascade initiator is dynamic since we do not know who will initiate another harmful diffusion next time, and (2) the diffusion process is dynamic since the propagation trajectory is uncertain. Under these two dynamic sources, we aim to accurately and fast select k nodes as sensors for early detection of harmful cascades. Besides, as the size of network becomes more and more large-scale, we also need to ensure the effectiveness of the proposed sensor selection methods for large networks.

Our contributions

To minimize the time to find the dynamic harmful cascades, our search attempts to optimize the selection of sensors in a scalable way. Our contributions are summarized as follows: We formulate a Detection Time Minimization (DTM) problem. To describe the dynamic harmful cascade on networks, we put forward a dynamic cascade model for the harmful diffusion. We prove that it is #P-hard to exactly calculate the objective function of DTM problem and propose two equivalent Monte-Carlo methods to estimate it. Each has its advantages and disadvantages. We show the NP-hardness of DTM problem as it can be shown to contain the Set Cover problem as a simple special case. We convert the DTM problem to a constrained max optimization problem, prove the submodularity of the new objective function, and then employ the greedy algorithm which achieves an approximation ratio of 1 − 1/e. We theoretically establish new upper bounds for the remaining time. Based on these bounds, we further propose a new Upper Bound based Greedy (UBG in short) algorithm which can significantly reduce the number of estimations in greedy-based algorithms, especially at the initial step. We propose two accelerations of UBG: Quickest-Path-UBG and Local-Reduction-UBG to address the DTM problem for large networks, which are close to UBG in performance but orders of magnitude faster than UBG in time complexity. In addition, our methods not only aim at early detection of harmful cascades, but also shed light on a host of other applications. For example, the goal of Emerging Topic Detection is to identify emergent topics in a social network, assuming full access to the stream of all postings. Providers, such as Twitter or Facebook, have an immediate access to all tweets or postings as they are submitted to their server [6], [29], while outside observers need an efficient mechanism to monitor changes, such as the methods developed in this work. Another example, an emerging trend in algorithmic stock trading is the use of automatic search through the Web and social networks for pieces of information that can be used in trading decisions before they appear in the more popular news sites [23], [27]. Similarly, intelligence, business and politics analysts are scanning online sources for new information. The theoretical framework and method established in this paper can also be used in these applications. The rest of the paper is organized as follows. Section 2 presents the cascade model and problem formulation. In Section 3 we show the # P-hardness of detection time calculation and derive two equivalent estimation methods. Section 4 is devoted to the NP-hardness of DTM problem, the submodular properties of transformed objective function, and the corresponding greedy algorithm. To prune the unnecessary estimation calls, we analysis the upper bounds for the new proposed UBG algorithm in Section 5. Section 6 presents two accelerations of UBG for large scale networks. Section 7 shows the experimental results. We conclude the paper in Section 8. Table 1 outlines the major variables used.

Table 1

Major variables in the paper.

Variables	Descriptions
G = (V, E)	social network G with node set V edge set E
N	number of nodes in the network G
I_t	set of infected nodes at step t
J_t	set of infected nodes before step t
k	number of sensors to be selected
Par(v)	set of parents of node v
τ(u, S)	the detection time as in Eq. (2)
T_max	time horizon we consider
D(S)	the expected value of detection time as in Eq. (4)
∏ = {π(u), u ∈ V}	probability distribution about the uncertainty of nodes being the infected source
R(S)	the remaining time for taking actions
ℙu	probability measure with initial infected node u
Eu	expectation operator with initial infected node u
Θtu	row vector with probabilities as in Eq. (17)
IP	N by N infection probability matrix

Major variables in the paper.

Problem formulation

In this section we start with a description of dynamic cascade model and then we define the Detection Time Minimization (DTM) problem abstracted from the early detection problem.

Dynamic cascade model

We start somewhat with the framework of [24], where the models introduced are essentially descriptive to specify a joint distribution over all nodes’ behavior in a global sense. In contrast, we focus on more operational models, from mathematical sociology [3] and interacting particle systems [10], to explicitly represent the step-by-step dynamics of infection. By considering operational models for the dynamic diffusion of a harmful item through a network G, represented by a directed graph, we will speak of each individual node as being either infected or uninfected. Motivated by the properties of contaminants like rumor and virus, we will focus from now on the progressive case in which nodes can switch from being uninfected to being infected, but do not switch in the other direction. We also focus on the setting that each node's tendency to become infected increases monotonically as more of its neighbors become active. Thus, the process can be reviewed with respect to some particular uninfected node : as time unfolds, more and more of 's neighbors become infected; at some point, this may cause to become infected, and 's decision may in turn trigger further infections by nodes connected with . Motivated by above observation, we put forward a dynamic susceptible-infected (DSI) model for the harmful cascade spreading, which can be seen as a variant of the common susceptible-infected (SI) model [3]. The susceptible nodes are those with at least one infected neighbor, and the infected nodes do not recover. Specifically, consider a directed graph G = (V, E) with N nodes in V and edge labels ip : E → (0, 1]. For each edge , denotes the infection probability that is infected by u in every attempt through the edge. If , . Let be the set of parent nodes of , i. e .,The DSI model is attached with a probability distribution Π = {π(u), u ∈ V} to describe the dynamic of cascade initiators. Π can be seen as a priori knowledge of nodes being infected initially, where ∑ π(u) = 1 holds. The DSI model first chooses an initially infected node u ∈ V according to distribution Π, and then it works as follows. Let I ⊆ V be the set of nodes that gets infected at step t ≥ 0, with I 0 = {u}. Definebe the cumulative set of nodes that get infected before step t ≥ 0. Then, at step t + 1, each node u ∈ J may infect its out-neighbors with an independent probability of . Thus, a node is infected at step t + 1 with the probabilityIf node is successfully infected, it is added into the set I . Then update J by J ⟵ J ∪ I . Note that each infected node has more than one chance to activate its susceptible out-neighbors until they get infected, and each node stays infected once it is infected by others. Obviously the cumulative infected process is Markovian.

Detection time minimization problem

In a diffusion model , given an initially infected node u ∈ V and a set of sensors S ⊆ V, the random detection time is defined aswhere a ∧ b : = min{a, b} and T is the time interval that we observe. We assume inf{ ∅ } =+∞. The random detection time τ(u, S) denotes the time delay that a contaminant initiated from node u is detected by one of the sensors in S. The random detection time τ(u, S) has the following useful properties: If u ∈ S, then τ(u, S) = 0; For any sensor sets S 1 and S 2 , then τ(u, S 1 ∪ S 2) = τ(u, S 1) ∧ τ(u, S 2); If the graph is undirected, then for any nodes u and ; If S ⊆ T, then τ(u, S) ≥ τ(u, T) for any node u; 0 ≤ τ(u, S) ≤ T for any node u and sensor set S. Above properties come from the definition in Eq. (2). We put the detailed proofs in Section 8.1. Note that the detection time τ(u, S) can be viewed as a special type of stopping time in stochastic process theory [11].□ Let the variable X denote the random initiator with distribution Π, i.e., for every u ∈ V, we haveThe (expected) detection time from a random initiator to one of the selected sensors in S can be define aswhere is an expectation operator under the cascade model. By the conditional expectation, we can calculate the expected detection time D(S) in the following way:Indeed, Eq. (5) can be reached like thiswhere σ(X) denotes the σ-algebra generated by variable X. Eq. (5) converts the global detection time as a summation of local detection time with u ∈ V. This conversion provides a basic for Section 5. Our goal is to find k nodes as sensors in a network in order to minimize the time until an infection – starting from a random initiator in the network – is detected. Formally, we formulate the problem as the following discrete optimization problem: we want to find a subset S * ⊆ V such that |S *| = k and D(S *) = min{D(S)||S| = k, S ⊆ V}, i. e .,where k is a given parameter determined by budget or monitor capacity. We call this as detection time minimization problem (DTM problem for short). For the sake of concreteness, we will discuss our results in terms of the DSI model in particular.

Detection time estimation

To solve the DTM problem Eq. (6), the first issue is how to calculate the objective function D(S) given a sensor set S. This question isn’t as easy as its description in Eq. (4). For example, the DSI process is underspecified, since we have not prescribed the order in which newly infected nodes in a given step t will attempt to activate their neighbors. Thus, it is not initially obvious that the process is even well-defined, in the sense that it yields the same distribution over outcomes regardless of how we schedule the attempted activations. Actually the computation of D(S) is #P-hard, by showing a reduction from the positive partitioned 2-DNF assignment counting problem [9]. Computing the detection time D(S) given a sensor set S is #P-hard. We prove the theorem by a reduction from the counting problem of the positive partitioned 2-DNF assignment [9]. For detailed proof, see Section 8.2. □ Since it is intractable to exactly compute D(S) on a typically sized graph, a natural idea is to employ Monte-Carlo methods to estimate D(S), which can be implemented in two different ways as follows:

Propagation simulation

The expected detection time D(S) is obtained by directly simulating the random process of diffusion triggered by a random node, say u, chosen according to the distribution Π defined as Eq. (3). Let I denote the set of nodes newly infected in the t-th iteration with I 0 = {u}. In the (t + 1)-th iteration, a node u ∈ J : = ∪ 0≤ I attempts to activate each uninfected neighbor with the probability . If it succeeds, is added into I . The process is repeated until the diffusion hits the sensor set S at some step T(u, S), i.e.,where inf{ ∅ } : =+∞ as convention. The detection time of this single simulation is recorded at T(u, S) ∧ T , which is right τ(u, S) defined in Eq. (2). We run such simulations for many times and finally estimate the expected detecting time D(S) by averaging over all simulations.

Snapshot simulation

According to the characteristic of the DSI model, the time cost that an infected node u spends in infecting its uninfected neighbor is distributed geometrically with parameter , and we denote this time cost as , i.e.,for any integral s ≥ 1. We can flip all coins a priori to produce a weighted graph G = (V, E, c), where an edge is labeled with the time cost - a sample of random variable . Actually such a snapshot provides an easy way to sample the detection time of any sensor set S, which exactly equals to the smallest time cost from the initially infected node, say u (chosen according to the distribution Π), to the nearest sensor in S. Define c(u, S) be the smallest time cost from u to the nearest sensor in S, then the detection time of this single simulation is recorded as c(u, S) ∧ T . We produce plenty of snapshots and finally estimate the expected detection time D(S) by averaging over all snapshots. The snapshot simulation is equivalent to the propagation simulation in estimating D(S). More specifically, the random variables T(u, S) and c(u, S) are identically distributed. Since , it is enough to prove that random variables T(u, S) and c(u, S) are identically distributed. For proof details, see Section 8.3. □ We have confirmed that the two methods are equivalent, while either has its own advantages and disadvantages. For estimating a specific D(S), the simulation method is faster, because it only needs to examine a small portion of edges while the snapshot method has to examine all the edges. For estimating the expected detection time of different sensor sets, the snapshot method outperforms the simulation method in terms of time complexity, since each snapshot serves all sensor sets. Under these observations, the heuristic algorithms use the propagation simulation, while the greedy-based algorithms employ the snapshot simulation in the experimental part.

NP-hardness and simple greedy

In this section we first show the NP-hardness of DTM problem in Eq. (6), as it can be shown to contain the Set Cover problem as a simple special case, then we propose a simple greedy algorithm for the DTM problem. The Set Cover problem is equivalent to deciding if there is a set S of k nodes in this bipartite graph with D(S) ≤ 1. Note that for the instance we have defined, activation is a deterministic process, as all probabilities are 0 or 1. Monitoring k nodes to detect diffusion initiated from U corresponding to sets in a Set Cover solution results in covering all n nodes corresponding to the ground set U, and if any set S of k nodes has D(S) ≤ 1, then the Set Cover problem must be solvable.□ The detection time minimization problem in Eq. (6) under DSI model is NP-hard. Consider an instance of the NP-complete Set Cover problem, defined by a collection of subsets of a ground set U = {u 1, u 2, …, u }, we wish to know whether there exist k of the subsets whose union is equal to U. We can assume that k < n < m here. We show that this can be viewed as a special case of the optimal problem (6). Given an arbitrary instance of the Set Cover problem, we define a corresponding directed bipartite graph with n + m nodes: there is a node i corresponding to each element u ∈ U, a node j corresponding to each set , and a directed edge (i, j) with activation probability ip(i, j) = 1 whenever u ∈ S . Define the probability distribution Π = {π(u), u ∈ V} on the knowledge of nodes being initially infected as follows Since the optimization problem is NP-hard and the network is prohibitively large, we cannot compute the optimum value to verify the actual quality of approximations. Hence a natural idea is to employ the greedy algorithm as approximation method. To make better use of the greedy algorithm, we consider an equivalent optimization problem Eq. (6) as follows,whereis defined as the (expected) remaining time for taking actions when a contaminant is detected. The above alternative formulation has key properties as described in the following Theorem 4. Theoretically, a non-negative real-valued function f on subsets of V is submodular, iffor all S ⊆ T ⊆ V and . That is f has diminishing marginal returns. Moreover, f is monotone, iffor all S ⊆ T. for all S ⊆ T ⊆ V and . By the second property of τ in Proposition 1, Eq. (13) is equivalent toNote that τ(u, T) ≤ τ(u, S) always works by the monotone decrease property of τ in Proposition 1. Now we discuss Eq. (14) separately according to the value of .Eq. (14) is proven and we thus get that R(S) is submodular.□ The remaining time function is monotone and submodular with R(∅) =0. It is obvious that the remaining time function R is monotone and R(∅) =0. Now we prove that R is submodular. According to the definitions in Eqs. (5) and (10), it suffices to show that If , both sides of Eq. (14) equal 0; If , Eq. (14) turns into , which obviously holds; If , Eq. (14) turns into , which holds by the monotone decrease of τ. By above properties in Theorem 4, the problem given in Eq. (9) can be approximated by the greedy algorithm in Algorithm 1 with the set function f : = R. For any submodular and monotone function f with f(∅) =0, the problem of finding a set S of size k that maximizes f(S) can be approximated by the greedy algorithm in Algorithm 1. The algorithm iteratively selects a new sensor u that maximizes the incremental change of f and includes the new sensor into the set S until k sensors have been selected. It is shown that the algorithm guarantees an approximation ratio of f(S)/f(S *) ≥ 1 −1/e, where S is the output of the greedy algorithm and S * is the optimal solution [30]. Greedy(k, R) In Greedy(k, R), a thorny problem is that there is no efficient way to compute R(S) given a placement S, and we turn to run snapshots for 10, 000 trials to obtain an accurate estimate of R(S), mentioned in Section 3. This actually leads to expensive selection time. Another source of inefficiency in Greedy(k, R) is that there exists O(kN) iterations at the remaining time estimation step, where k is the size of the initial sensor set, and N is the number of nodes. When N is large, the efficiency of the algorithm is unsatisfactory. Hence, in order to improve the efficiency of Greedy(k, R), one can either reduce the number of calls for evaluating R(S), or develop advanced heuristic algorithms which can conduct fast and approximate estimations for R(S) at the expense of accuracy guarantees.

Upper bound based Greedy

In order to prune the estimation calls in Greedy(k, R), a natural idea is to employ the Cost-Effective Lazy Forward selection (CELF) algorithm proposed in [24]. The principle behind is that the marginal gain of a node in the current iteration cannot be more than that in previous iterations, and thus the number of estimation calls can be greatly pruned. CELF optimization produces the same sensor set as the original greedy algorithm, but it is much faster than the original one [24]. Although CELF significantly improves Greedy(k, R), the sensor selection time is still unaffordable on large networks. In particular, in the first round to establish the initial upper bounds, CELF needs to estimate using MC simulations for each node , leading to N times of MC calls, which is time-consuming, especially when the network is very large. The limitation leads to a rather fundamental question that, can we derive an upper bound of which can be used to further prune unnecessary detection time estimations (MC calls) in Greedy(k, R)? Motivated by this question and the idea in [40], in this section we derive an initial upper bound of for Greedy(k, R). Based on the bound, we propose a new greedy algorithm Upper Bound based Greedy (UBG for short), which outperforms the original CELF algorithm. Essentially different from the bounds that derived for the influence spread under the IC model [40], we here derive a new upper bound for the remaining time under the DSI model. For simplicity, we denote , and for all hereafter.

Preparations

In this part, we aim to derive an upper bound of . Before introducing the upper bounds in Theorem 5, we first prepare two propositions. Let denote the probability that node becomes infected before step t when the initially infected node is u. We have the first proposition as follows. where is a binary indicative function, if is not infected before step t, ; otherwise, . Then we havewhere the fourth ‘=’ is due to the fact that ∑ π(u) = 1 in above derivation. □ For , the remaining time under the DSI model can be calculated as In fact, by the definition in Eq. (2), we first have Proposition 2 reveals that we can treat the global remaining time as a summation of all T max propagation steps of local probabilities . Based on Proposition 2, a following question is, what is the relationship between two sets, and where the first ’≤’ is due to , and the second ’≤’ comes from the fact that in the above derivation.□ For k ≥ 1, we have the following inequation For k ≥ 1, by the definition of conditional expectation and DSI model, we obtain Proposition 3 clearly identifies the ordering relationship between two adjacent elements in the series . Now we simplify the results in Proposition 3 by using the form of matrix. Let IP be the infection probability matrix with the element at position being . For t ≥ 0, denote the row vectoras the probabilities of nodes being infected before step t, i. e ., Obviously, we have . Now we can rewrite Proposition 3 by using the matrix form,By iteration, we further get that , where E is a unit matrix. Furthermore, due to the definition of probability , it follows thatHereafter, define A ∧ 1 : = {a(i, j) ∧ 1} for a matrix A = {a(i, j)}.

The Upper Bound of

With the above preparations, we can present the results on upper bound of remaining time as follows, where E is a unit matrix and means the element at position in matrix A. where is the element at position in vector a.□ The upper bound of remaining time function for each node is With the preparations of Propositions 2 and 3, it follows that Define the remaining time row vector , then the upper bound in Eq. (20) turns to bewhere Π is a prior distribution on the likelihood of nodes being the infected source.

The calculation of the upper bound

We first use a toy example to explain how to calculate the upper bound. Let T max = 10, we haveAssume the prior distribution Π is uniform on the entire graph. Based on Eq. (21), the upper bound of remaining time can be calculated as follows,In other words, we have R() ≤5.1687, R() ≤5.0698, R() ≤4.1376, and R() ≤7.6006. □ Given a directed network G in Fig. 1 with infection probability matrix in Eq. (22).

Fig. 1

An illustration of the upper bound calculation.

An illustration of the upper bound calculation. The matrix calculation used in the upper bound is expensive, because is intractable when the network size is large. To overcome the difficulty, we adopt the following procedure to calculate . DenoteIf t ≥1 ≤ T − 1, it implies that we do not need to calculate (E + IP) any more when t ≥ t ≥1, and therefore we havewhere 1 is a N × N matrix with all elements being 1. Additionally we find that where the approximation stems from Taylor expansion.□ If the infection probabilities are relatively small, we have the following approximation In fact, when ∥IP∥ is small enough, it follows that By Eqs. (21) and (25), the new upper bound of remaining time can be approximately put aswhich is relatively tractable when ∥IP∥ is small.

The UBG algorithm

Based on the upper bound, we propose a new UBG algorithm for early outbreak detection. First we explain the difference between UBG and CELF. The CELF algorithm [24] exploits the submodular property to improve the original greedy algorithm. However, CELF demands N (the number of nodes in the network) remaining time estimations to establish the initial bounds of marginal increments, which is time expensive on large graphs. In contrast, the proposed Upper Bound based Greedy (UBG) algorithm uses the derived new bound to further reduce the number of remaining time estimations, especially in the initialization step. This way, the nodes are ranked by their upper bound scores which can be used to prune unnecessary calculations in the CELF algorithm. We use Example 2 for illustration. We still use the network in Example 1 for explanation. The goal here is to find the top-1 node with maximum remaining time. Obviously, the upper bound of R(), 7.6006, is the largest in the graph. Thus, we use MC simulation to estimate R(), and get R() ≈6.5159. Now, we can observe that 6.5159 is already larger than the upper bounds of R(), R() and R(). Thus, we do not need extra MC simulations to estimate the remaining time of the other three nodes, and R() is the node with the maximal remaining time in the network. □ We summarize the UBG algorithm in Algorithm 2. UBG In Algorithm 2, the column vector, δ = {δ }, denotes upper bounds of marginal increments under the current sensor set S, i.e., δ ≥ R(S ∪ {u}) − R(S) . Before searching for the first node (i.e. S =∅), we estimate an upper bound for each node by Eq. (26). Then, the algorithm proceeds similar to CELF. Note that due to the submodular properties, these upper bounds of marginal increments can be dynamically adjusted by estimation calls, which becomes smaller and smaller with the algorithm carrying on. In the algorithm, MC(S) denotes the Monte-Carlo simulations that are used to estimate R(S) for the sensor set S, denotes that Monte-Carlo simulations have not been used on node in the current iteration, and means that Monte-Carlo simulations have already been computed on node .

Accelerations of UBG

Even with the UBG algorithm proposed in Section 5, its running time is still unbearable and may not be suitable for large social networks. Although UBG can greatly reduce the number of remaining time estimation calls, each estimation call is very time-consuming, as it needs produce enough samples and average them. Hence a possible alternative to further accelerate UBG is to employ heuristics to approximate the remaining time R(S) or the marginal return R(S ∪ {u}) − R(S) in UBG. Along this idea, we here introduce two accelerative algorithms to address the inefficiency of UBG: Quickest-Path-UBG and Local-Reduction-UBG. The experimental results will show that (i) both of them are efficient in terms of running time, and (ii) the former one applies to sparse networks better, while the latter one is more effective in dense networks in terms of detection time.

Quickest-Path-UBG

In this part we introduce a tractable heuristic to approximate the remaining time R(S). From Eq. (5) and Eq. (10), the key point of estimating R(S) is how to estimate . Note that measures the expected time delay of a diffusion initiated from node u propagating to sensor set S. An intuitive idea is that the most likely propagation path should be the quickest path from node u to the set S. Hence, a question is how to measure the quickest path from node u to the set S? The proposed Quickest-Path-UBG is inspired by answering the question. According to the geometric distribution, if a random variable X is distributed geometrically with a parameter p, it follows that . Since the time that a node u spends in infecting its uninfected neighborhood is distributed geometrically with the parameter , the value should be the expected time cost that an infection propagates from node u to node along the edge . Based on this observation, we label the graph G = (V, E) with a time cost function m : E → [1, ∞) as follows: for each edge . Fig. 2 shows an example.

Fig. 2

An illustration of graph conversion.

An illustration of graph conversion. For any two nodes u and , let be the smallest time cost among the paths connecting u and . For example, d(,) = min{5 + 3.3, 10 + 5} = 8.3 in Fig. 2. Then, for a subset S ⊆ V, we defineIntuitively, d(u, S) denotes the expected time that an infection propagates from u to S along the quickest path in the graph G = (V, E, m). Therefore, a fundamental question arises, is there some approximate relationship between the detection time and the shortest distance. Theoretically we have where we use an result borrowed from probability theory for the approximation: consider random variables with distribution function respectively, if stands in the sense of distribution for i = 1, …, n with i ≠ i 0 (i.e. works for all x), then .□ When the shortest path from u to S in graph G = (V, E, m) is unique, the quantity d(u, S) ∧ T can be used to approximate the detection time , i.e., In fact, when the shortest path from u to S in graph G = (V, E, m) is unique, it follows that Based on Theorem 6, combining Eq. (5), Eq. (10) and Eq. (28), we have the following derivation, Hence we can approximate the remaining time R(S ∪ {u}) by rather than the heavy Monte-Carlo estimation MC(S ∪ {u}) in the 09th row of UBG algorithm. We nail down this new method as Quickest-Path-UBG, which is shown explicitly in Algorithm 3. Quickest-Path-UBG Note that the reason we call this acceleration as Quickest-Path-UBG rather than Shortest-Path-UBG lies in that, our Quickest-Path-UBG introduces a new measurement to reflect the infection time cost rather than the distance in common sense.

Local-Reduction-UBG

In this part we turn to a tractable heuristic to approximate the marginal return δ (S) : = R(S ∪ {u}) − R(S) in the 09th row of UBG algorithm. To facilitate the description, we will approximate the marginal reduction D(S) − D(S ∪ {u}) rather than the marginal return R(S ∪ {u}) − R(S), due to the fact that R(S) = T − D(S). In sociology literature, degree and other centrality-based heuristics are commonly used to select influential nodes in social networks [37]. These methods usually looked over the multi-step complexity of network diffusions, and assumed that the infection can propagate ahead only one hop before the end. In other words, if a node initiates an infection, the diffusion spread is at most its first-order neighbors. We call this as one-hop assumption hereafter. Especially, experimental results in [24] showed that selecting vertices with maximum indegree as sensors results in earlier infection detection than other heuristics, which validated the rationality of one-hop assumption to some extent. Under this observation, we will propose a tractable method upon the one-hop assumption to approximate the marginal reduction D(S) − D(S ∪ {u}). We first introduce a useful result in probability theory: if the random variable X is distributed geometrically with parameter p for i = 1, ⋯, n and they are independent, then ∧ X is distributed geometrically with parameter . Of course, we have According to the characteristic of the DSI model, the time cost that an infected node spends in infecting its uninfected neighbor is distributed geometrically with parameter . If the initially infected node has multi-neighbors in sensor set S, the infections to different neighbors are independent. Under the one-hop assumption, the expected detection time of can thus be calculated like thisIf the initially infected node has no neighbors in sensor set S, the expected detection time of is defined as T . If , it follows that . Fig. 3 presents an example to show the calculation result of expected detection time for each node under the one-hop assumption.

Fig. 3

An example to show the calculation results of expected detection times under the one-hop assumption. Here the seed set S = {a, b} is selected as sensor set for monitoring and T is assigned to be larger than 10. The red boldfaced character on the left of each node is the numerical value of calculated by Eq. (29) for each . With the preparations above, we are about to calculate the marginal reduction δ (S) = D(S) − D(S ∪ {u}) under the one-hop assumption. Assume the set S has been selected as sensor set to detect infections. When considering adding another node u as a new sensor into S, the reduction of the expected detection time can be put as followswhich is by Eq. (5). We now use Theorem 7 to present the concrete expression of the right part of Eq. (30). Similar with the definition of Eq. (1), we predefine the parents of set S as follows, under one-hop assumption, where f(u, S) is defined as . In the DSI model with infection probability on a directed graph G = (V, E), the reduction δ (S) of the expected detection time can be calculated like this Under one-hop assumption, the reduction δ (S) of the expected detection time incurred on selecting u into the sensor set S mainly includes: (a) u itself reduced from f(u, S) to 0; (b) each in the remaining non-detected parents of u reduced from T to ; and (c) each in the intersection of parents of u and parents of S reduced from to . In other cases, we have . Hence Eq. (32) is followed. □ The reason that we call it as Local-Reduction-UBG is that we only consider the local change in monitoring {u} and Par(u) ∖ Par(S) to approximate the marginal reduction D(S) − D(S ∪ {u}), which is right in response to the one-hop assumption. For the DSI model with small propagation probabilities pp and time terminal T , the assumption is reasonable. Now we are back to the UBG algorithm. We can approximate the marginal return δ (S) : = R(S ∪ {u}) − R(S) = D(S) − D(S ∪ {u}) by Eq. (32) to avoid the heavy Monte-Carlo estimation in the 09th row of UBG algorithm. We call this new method as Local-Reduction-UBG, which is shown explicitly in Algorithm 4. Local-Reduction-UBG

Experiments

We conduct experiments on both synthetic and real-world data sets to evaluate the UBG algorithm, the Quickest-Path-UBG algorithm and Local-Reduction-UBG algorithm. We implement the algorithms using C++ with the Standard Template Library (STL). All experiments are run on a Linux (Ubuntu 11.10) machine with six-core 1400 MHz AMD CPU and 32 GB memory.

Data Sets

Three real and one synthetic data sets are used for comparisons. The Digger data1 is a heterogeneous network, including Digg stories, user actions (submit, digg, comment and reply) with respect to the stories, and friendship relations among users. The Twitter and Epinions data sets can both be obtained from Stanford Datasource2 . Epinions is a general consumer review site where visitors can read reviews about a variety of items to help them decide a purchase. The synthetic Small-world data set is the type of graphs in which each node can be reached by a small number of hops. For small-world model we set the parameter of the nearest neighbors k = 15 and the rewiring probability p = 0.1. The above networks are representative ones, covering a variety of networks with different types of relations and sizes. The details of the data sets are listed in Table 2 where degree means in-degree. In our experiments, an undirected graph is regarded as a bidirectional graph.

Table 2

Statistics of the four real-world networks.

Dataset	Digger	Twitter	Epinions	Small-world
#Node	8194	32,986	51,783	200,000
#Edge	56,440	763,713	476,491	3,000,000
Average degree	6.9	23.2	9.2	15.0
Maximal degree	850	674	190	29

Statistics of the four real-world networks.

Benchmark methods

We compare the UBG in Algorithm 2, the Quickest-Path-UBG in Algorithm 3, and Local-Reduction-UBG in Algorithm 4 with both the greedy and heuristic algorithms. CELF [24]. The state-of-the-art greedy algorithm, where uses 10,000 snapshots in the whole process for any network. DEGREE [37]. A heuristic algorithm based on “degree centrality”, with high-degree nodes as key ones. The seeds are the nodes with the k highest in-degrees. INTER-MONITOR DISTANCE [35]. A heuristic algorithm which requires any pair of sensors to be at least d hops away, where d is as large as it can choose k monitors. PageRank [5]. A link analysis algorithm which ranks the importance of pages in a Web graph. We implement the power method with a damping factor of 0.85, and pick the k highest-ranked nodes as seeds. Random. It simply selects k random vertices in the graph as the seed set, which is taken as the baseline. In our experiments, to obtain the detection time of sensor sets provided by heuristic algorithms, we run Monte-Carlo simulation on the networks 10, 000 times and calculate the mean. The simple greedy algorithm is not compared because many works have reported that CELF has the same optimization result and less running time. Since the DEGREE heuristic is the state-of-the-art [37], we do not implement heuristics such as distance centrality and betweenness centrality-based heuristics.

Parameter Setting

We mainly report results on a uniform infection probability of 0.1 assigned to each directed link in the network, i.e., for any directed edge . One can refer to the work [14], [33], [36] for learning real values of the parameters from available data. Besides, we let the time horizon T max = 30 and the prior distribution Π be uniform in the network.

Results

Evaluations of Monte-Carlo methods

In Section 3, we proposed two Monte-Carlo methods to estimate D(S). Table 3 shows the estimation results of Propagation Simulation and Snapshot Simulation. Here we select 10 nodes with the highest in-degrees in each network as the sensor set S. We can find that Propagation Simulation and Snapshot Simulation release almost the same estimation results, which confirms their equivalence in estimating D(S). Fig. 4 shows the cumulative time cost of these two Monte-Carlo methods. Here we randomly select 10 sensor sets {S 1, S 2, …, S 10} from the Digger data and every sensor set has five sensor nodes, i.e. |S | = 5 for all i = 1, 2, …, 10. We can see that the cumulative time cost of Propagation Simulation increases linearly, while that of Snapshot Simulation has a big jump at the first sensor set and then increases slowly. The reason behind is that Snapshot Simulation needs to establish numerous snapshots to estimate the first D(S 1), and these established snapshots can be reused in the posterior estimations of .

Table 3

Comparison of two Monte-Carlo methods on estimating D(S).

Comparison item	Digger	Twitter	Epinions	Small-world
Propagation simulation	10.056	12.236	13.216	13.738
Snapshot simulation	10.053	12.240	13.212	13.737

Fig. 4

The cumulative time cost in estimating detection time D(S) of ten different sensor sets.

Comparison of two Monte-Carlo methods on estimating D(S). The cumulative time cost in estimating detection time D(S) of ten different sensor sets.

Evaluations of the upper bounds

Table 4 shows the gap between the real value of remaining time R(S) and its upper bounds. Here we also select ten nodes with the highest in-degrees in each network as the sensor set S. The real value of remaining time R(S) is obtained by Propagation Simulations. The Upper Bound (I) means the upper bound value presented in Eq. (21), and the Upper Bound (II) means the upper bound value presented in Eq. (26). The experimental results reveal that the real value is close to Upper Bound (I) and Upper Bound (II) on all the four data sets, which in turn verifies the availability of the proposed upper bounds Eqs. (21) and (26).

Table 4

Evaluations of the upper bounds.

Comparison item	Digger	Twitter	Epinions	Small-world
Real value	10.056	12.236	13.216	13.738
Upper Bound (I)	11.437	12.820	13.768	14.431
Upper Bound (II)	11.739	13.026	14.271	14.609

Evaluations of the upper bounds.

Number of Estimation calls

In Table 5 , we compare the number of estimation or approximation calls at the first 10 iterations among CELF, UBG, Quickest-Path-UBG and Local-Reduction-UBG on the four data sets. From the results, we can observe that the call number in UBG and its two accelerations are significantly reduced compared to that in CELF, especially at the first round. One may notice that in Table 5, CELF occasionally defeats our methods, but the total call number of our methods are much less than CELF. As listed in 5, the total call number of the first 10 iterations of UBG and its two accelerations, compared to CELF, is reduced at a rate of 94% at least on the four data sets. Similarly, at least 81% reduction of call numbers of CELF can be observed in the first 50 iterations. From the observation, we can conclude that our UBG is more efficient than CELF on large networks.

Table 5

The number of estimation calls at the first ten iterations.

Datasets	Algorithms	1	2	3	4	5	6	7	8	9	10	Sum(1:10)	Sum(1:50)
Digger	CELF	8194	14	22	32	55	38	19	38	17	28	8457	10237
	UBG	67	52	23	9	41	38	22	38	82	52	424	1894
	Quickest-Path-UBG	71	56	24	21	42	54	32	29	79	56	464	1923
	Local-Reduction-UBG	56	49	34	32	59	34	34	39	68	44	458	1892

Twitter	CELF	32,986	323	121	28	18	78	67	38	98	82	33839	36262
	UBG	448	31	23	179	112	152	36	251	134	97	1463	3256
	Quickest-Path-UBG	452	29	28	147	149	141	31	302	135	102	1516	3672
	Local-Reduction-UBG	534	43	30	134	183	147	45	312	138	138	1704	4387

Epinions	CELF	51,783	216	371	102	98	46	29	15	12	115	52787	56213
	UBG	437	193	87	227	169	161	82	134	120	136	1746	4418
	Quickest-Path-UBG	467	203	91	201	159	174	83	142	118	159	1797	4821
	Local-Reduction-UBG	523	213	109	235	178	182	92	156	135	182	2005	6236

Small-world	CELF	200,000	1092	346	167	389	254	138	76	275	146	202883	213782
	UBG	2367	723	452	431	217	319	231	373	78	267	5458	14320
	Quickest-Path-UBG	3423	872	421	543	231	341	245	401	102	231	6810	15689
	Local-Reduction-UBG	2345	687	328	341	238	347	277	353	87	356	5359	13940

The number of estimation calls at the first ten iterations.

Detection time

Detection time measures the time delay of a message propagated from a diffusion source to a sensor. We run tests on the four data sets and obtain detection time results w.r.t. parameter k (the number of sensors), where k increases from 1 to 50 as shown in Fig. 5 . UBG, as an updated version of CELF, has competitive results on the four data sets. More importantly, the detection times of UBG and CELF are completely the same in the four figures, which explains again that UBG and CELF share the same results in sensor selection. The only difference between UBG and CELF is the number of remaining time estimation calls.

Fig. 5

Detection time w.r.t. the number of sensors k on the four data sets.

Detection time w.r.t. the number of sensors k on the four data sets. The Quickest-Path-UBG and Local-reduction-UBG always perform better than other heuristics. More specifically, if the networks are sparse like Digger and Epinions, the Quickest-Path-UBG is more competitive; if the networks are dense like Twitter and Small-world, the Local-reduction-UBG is more competitive.

Selection time

Selection time measures the time cost of an algorithm selecting sensors. Fig. 6 shows the time cost of selecting sensors with k = 50. UBG is 4-8 times faster than CELF. One may argue that such a low improvement of UBG can be neglected in large networks. In fact, UBG scales well to large networks. Because, with the size of a network increase, Monte-Carlo simulations take more time, and thus UBG will achieve better performance by pruning more unnecessary Monte-Carlo simulations.

Fig. 6

The selection time of the algorithms.

The selection time of the algorithms. As to heuristics, Degree and Random are very fast in selecting candidate nodes, which take less than 1 second. Quickest-Path-UBG and Local-Reduction-UBG are exciting and adoptable, due to their good performance in detection time. The PageRank and Inter-Monitor Distance are slightly slower and undesirable, in view of their poor performance in detection time.

Sensitivity analysis

We run tests on Twitter data set and obtain the selection time w.r.t. parameter ip (the infection probability). In the experiments, ip increases from 0.1 to 0.5. We assign a uniform infection probability ip to each directed link under the DSI Model. From the results in Fig. 7 , we can conclude that with the parameter ip growing larger, the CELF is more time-consuming. By contrast, the UBG and its accelerations are robust and insensitive to the parameter ip.

Fig. 7

Selection time of different greedy algorithms network infection probability with seed size k = 50.

Some detailed proofs

Proof of Proposition 1 in Section 2.2

For simplicity, we denote the set {t ≥ 0 : I ∩ S ≠ ∅ withI 0 = {u}} by and denote by T(u, S). Then the definition in Eq. (2) can be rewritten as . 1) If u ∈ S and I 0 = {u}, which means . Hence τ(u, S) = 0 by definition in Eq. (2). 2) First, we prove . For any , I ∩ (S 1 ∪ S 2) ≠∅ holds. From I ∩ (S 1 ∪ S 2) = (I ∩ S 1) ∪ (I ∩ S 2) ≠∅, we know either I ∩ S 1 or I ∩ S 2 is nonempty, which implies . Hence works. Conversely, for any , t will be located in or . Without loss of generality, we assume , which means I ∩ S 1 ≠∅. Since S 1 ⊂ S 1 ∪ S 2, I ∩ (S 1 ∪ S 2) is nonempty, which indicates . Hence also works. Second, we prove T(u, S 1 ∪ S 2) = T(u, S 1) ∧ T(u, S 2). We take infimum operations on both sides of the first step conclusion, i.e., . Since all these sets are be composed of integers, we obviously get T(u, S 1 ∪ S 2) = T(u, S 1) ∧ T(u, S 2). Finally, we take the infimums with T on both sides of the second step conclusion, i.e., T(u, S 1 ∪ S 2) ∧ T = T(u, S 1) ∧ T(u, S 2) ∧ T , which means τ(u, S 1 ∪ S 2) = τ(u, S 1) ∧ τ(u, S 2) by the definition of τ. 3) Similarly, in order to prove this property, we only need to prove , i.e., for any t ≥ 0, the probability that t is located in equals the probability that t is located in . For any instance , as the graph is undirected, we can regard as the initiator and reverse the process. So, in these instances t is also located in . Since there may be other instances with , the inequality holds. Conversely, we can also verify . Therefore, , by which we can easily get the property. 4) If S ⊂ T, in order to prove this property, we only need to prove and notice the fact that the infimum of a larger set is smaller. For any , I ∩ S ≠∅ holds. Besides, as S ⊂ T, I ∩ S ≠∅ implies I ∩ T ≠∅, which means t is also in . So, , by which we can easily conclude the property, is verified. 5) If is empty, then T(u, S) =+∞. τ(u, S) =+ ∞ ∧ T = T ∈ [0, T ]. If is nonempty, then every element in it is greater than 0 and notice that τ(u, S) = T(u, S) ∧ T ≤ T . Therefore, we can conclude 0 ≤ τ(u, S) ≤ T .

Proof of Theorem 1 in Section 3

We prove the theorem by a reduction from the counting problem of the positive partitioned 2-DNF assignment [9], denoted by (A1). (A1): Let E be a subset of and X , Y , (i, j) ∈ E are pairwise distinct variables. We define a formula F : = ∨ {X ∧ Y : (i, j) ∈ E}, which is a disjunction of all the conjunctive clauses X ∧ Y , (i, j) ∈ E. How many valuations that satisfy the formula F? Then we introduce the following problem (A2). (A2): G = (V, E) is a directed graph, where each edge e ∈ E is associated with a weight which is geometrically distributed with a parameter p, i.e. . s, t are two nodes in G and k is a positive integer. What is the probability of the event that the shortest path from s to t has length T at least k? Now we prove that (A1) is reducible to (A2). We map the variable X to a node x and variable Y to a node y . If (i, j) exists in E, then we add an edge from x to y . Also, we add a source node s and an edge (s, x ), associated with weight which is geometrically distributed with parameter 1/2, for any i ∈ {i : (i, j) ∈ E}. Thus, equals 1 with probability 1/2 and strictly greater than 1 with probability 1/2. Similarly, we add a terminal node t and an edge from each y to t with a weight identically distributed as . So we construct a graph with geometrically distributed weights on edges. Now we can build a probability-preserving bijection between the valuations of X , Y and the subgraphs of the constructed random graph: for a valuation ν, the corresponding subgraph is the one where each edge adjacent to x has length 1 iff ν(X ) = 1. The same argument for Y . We set k = 3 and claim the fact that the probability in (A2) is the number of valuations satisfying that F is divided by 2, where N is the number of variables. Indeed, the fact is based on the following observation: a valuation is true iff some adjacent pair of X and Y is true if the incident edges to the corresponding x and y have length 1 in the corresponding subgraph. Note that any edge of length greater than 1 is irrelevant as the structure of the graph, which ensures it can never be part of a path of length 3 from s to t, For any path from s to t must jump three times: from s to some x , from x to some y , and from y to t. Thus, from what we have stated, (A1) can be reducible to (A2), which implies (A2) is #P-hard. We aim to solve the problem (denoted by (A)) of calculating the detection time under the DSI model. Note that and . Hence (A2) can be reducible to (A), and (A) is #P-hard. Therefore, Computing the detection time under the DSI model is #P-hard.

Proof of Theorem 2 in Section 3

In order to prove the random variables T(u, S) and c(u, S) are identically distributed, we need to prove for any t > 0. On the one hand, we can prove . Given t > 0, we find that for any simulation with T(u, S) = t, a snapshot with c(u, S) = t can be constructed. By definition, if T(u, S) = t, then J ∩ S = ∅, i = 0, 1, ⋯, t − 1 and J ∩ S ≠∅. For any node , suppose that is infected through the edge with . Denote n = inf{j : s ∈ J }. Then we let the edge take weight . For any node s′ ∈ J , we denote , i.e. . Then for any , let the edge take weight , where C is sampled from a geometric distribution. In addition, we take a weight sampled from geometric distribution on each of the rest edges. In such operations, a snapshot with c(u, S) = t is constructed, which implies . On the other hand, we can also get . For any snapshot with c(u, S) = t, a simulation with T(u, S) = t can be constructed. We let node u be the initiator, i.e. J 0 = {u} and denote the smallest number in the set by n 0. Meanwhile, we denote the set of nodes by . Let . Complementally, we define all the J = J 0, i = 0, ⋯, n 0 − 1. Denote the smallest number in the set by n and the set of nodes by . Let . Complementally, we define J : = J for all i = t, ⋯, t + n − 1. We now can construct a simulation of the diffusion model . Hence, we have a simulation with T(u, S) = t, which implies . Therefore, we have for any t > 0, which implies that T(u, S) and c(u, S) are identically distributed. Hence, the simulation and snapshot simulations are equivalent in estimating D(S).

Conclusions and further works

In this paper, we discussed scalable solutions to detection time minimization problem for early detection of dynamic harmful cascades in networks. To address this problem, we proposed an upper bound based greedy (UBG) algorithm and its two accelerative algorithms to cater for lager scale networks. The UBG solution guarantees a near-optimal detection time, pruning at least 80% Monte-Carlo calls of CELF. The novel accelerations on UBG can significantly reduce the selection time and further achieve at least 102 times speed-raising. There are several interesting future directions. First, the detection time D(S) is # P-hard to calculate exactly, it is still a question how to design an efficient algorithm to estimate D(S) with a theoretical guarantee. Second, the infection probability ip and prior distribution Π are predefined in this paper, how to learn these parameters for the DSI model from available cascade data still remains unexplored.

1:	initialize S =∅
2:	fori = 1 to kdo
3:	select u=argmaxw∈V∖S(R(S∪{w})−R(S))
4:	S = S ∪ {u}
5:	end for
6:	output S

01:	Input: the infection probability matrix IP of
	G = (V, E), a budget k, a prior distribution Π
02:	Output: the sensor set S with k nodes
03:	initial S ⟵ ∅, R(S) ⟵ 0, and
	δ⟵Tmax·Π·(E+Tmax−12IP)
04:	fori = 1 to kdo
05:	set I(v)⟵0 for v∈V∖S
06:	while TRUE do
07:	{u⟵argmaxv∈V∖Sδv
08:	ifI(u) = 0
09:	δ_u ⟵ MC(S ∪ {u}) − R(S)
10:	I(u) ⟵ 1
11:	end if
12:	ifδu≥maxv∈V∖(S∪{u})δv
13:	R(S ∪ {u}) ⟵ R(S) + δ_u
14:	S ⟵ S ∪ {u}
15:	break
16:	end if}
17:	end for
18:	output S

01 - 02, 04 - 08, 10 - 18: the same with that of Algorithm 2

03: initial S ⟵ ∅, R(S) ⟵ 0, the distance d on G

and δ⟵Tmax·Π·(E+Tmax−12IP)

09: δu⟵∑v∈Vπ(v)[Tmax−d(v,S∪{u})∧Tmax]−R(S)

01 – 08, 10 – 12, 14 – 18: the same with that of Algorithm 2

09: δ_u⟵ the right part of Eq. (32)

13: This row is removed.

5 in total