Literature DB >> 25806810

Temporal effects in trend prediction: identifying the most popular nodes in the future.

Abstract

Prediction is an important problem in different science domains. In this paper, we focus on trend prediction in complex networks, i.e. to identify the most popular nodes in the future. Due to the preferential attachment mechanism in real systems, nodes' recent degree and cumulative degree have been successfully applied to design trend prediction methods. Here we took into account more detailed information about the network evolution and proposed a temporal-based predictor (TBP). The TBP predicts the future trend by the node strength in the weighted network with the link weight equal to its exponential aging. Three data sets with time information are used to test the performance of the new method. We find that TBP have high general accuracy in predicting the future most popular nodes. More importantly, it can identify many potential objects with low popularity in the past but high popularity in the future. The effect of the decay speed in the exponential aging on the results is discussed in detail.

Entities: Chemical Gene Species

Mesh：

Year: 2015 PMID： 25806810 PMCID： PMC4373959 DOI： 10.1371/journal.pone.0120735

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

The emergence of online social media and rich user-generated content bring the information overload problem. The online content has become increasingly abundant and immediately available, and users cannot go through every piece of information to find the high quality ones [1]. These high quality source, in general, have formidable power to impact opinions, culture, and policy, as well as advertising profit. Thus they usually attract a lot of attention and become eventually popular. The rapid development of the Internet results in the availability of a huge amount of data with time information, making it possible to study the popularity dynamic of the online content. It was found that the popularity of various pieces of content on the Web, like news [2], Twitter [3], blog posts [4], videos [5, 6], posts in online discussion forums [7] and product reviews [8], vary significantly on temporal scales. In this context, the early identification of the eventual popular content becomes an important problem [9]. It cannot only improve the user experience as the prediction save their searching time for high quality contents among unpopular ones, but also bring commercial profit for the online vendor as it helps them better manage their inventory. Although the preferential attachment (PA) is a success in explaining the power-law distribution that widely found in real systems [10], it performs not satisfying enough when applied to predict the future popularity (or degree) of nodes. For example, it was found in the citation networks that some papers can attract significantly more citations than the prediction from PA [11, 12]. The ability of a node to attract new links is found to decay exponentially with time, both in citation networks [13] and information access [14]. Moreover, temporal analysis of the popularity dynamics of online content in Wikipedia [15] and micro blog [16] shows the burst pattern. An extensive study of how the content’s popularity grows and fades over time in online media has been presented in ref. [17]. Besides the experimental study of the temporal dynamic, some possible mechanisms that may contribute to the experimental finding are proposed, such as relevance and time decay [13], random popularity shifts [15] and human dynamics [5]. All these studies have shown that the high cumulative degree of nodes is not a guarantee for large degree increase in the future. In this paper, we focus on the trend prediction in complex networks, i.e. to identify the most popular nodes in the future. In the literature, there are some existing methods for this problem especially for a certain application field [18-21]. For instance, in the popular online service Digg.com, the initial growth popularity has been used to predict its later popularity [19]. Popularity-based predictor has been designed for trend prediction and its performance is shown to be further enhanced if the user social network is incorporated [20]. In this paper, we introduce a temporal-based predictor (TBP) which takes advantage of the time decay effect found in many empirical works. The validation of the method is conducted in three time-stamped data sets. The results show that the prediction precision is remarkably higher than that of PA. In addition, the new method is especially effective in identifying the potential nodes with low popularity in the past but high popularity in the future.

Materials and Methods

The system we considered in this paper can be modeled by the bipartite network which consists of a set of users U and a set of objects O. We use Latin letters for users and Greek letters for objects to distinguish them. A bipartite network can be represented by an adjacency matrix A, where elements A are equal to 1 if user i has collected object α and 0 otherwise. We consider snapshots of these networks at different time stamps by taking into account only the links established before a given time t, and we use A(t) to denote the adjacency matrix at time t. The number of objects collected by user i and the number of users who collected object α at time t (i.e., user degree and object degree) are computed as k (t) = ∑ A (t) and k (t) = ∑ A (t), respectively. The popularity increase of object α in future T time steps (i.e. the future time window) is then For a suitably chosen value of T , this quantity can measure the temporal interest in object α. The main goal of trend prediction in this paper is to identify the most popular objects in the future. To this end, we define a testing time t and a future time window of length T , and rank all objects according to their popularity increase Δk (t, T ). This ranking is considered as the true ranking of popularity in the future. A generic predictor will make use of the information before t and assign prediction scores s to all objects. These scores will be mapped into a predicted ranking. In general, the higher overlap of the predicted ranking and the true ranking, the better the predictor is.

Popularity-based Predictor

Preferential attachment is a well-known mechanism of network evolution which assumes that the probability a node to attract a new link is proportional to its cumulative degree. In trend prediction, this means that objects which are popular at time t are expected to have better chances to attract new links from users. This implies that the cumulative degree of an object k (t) is a good predictor of its future popularity increase. Considering the decaying interest in objects, the prediction scores can be set as the recent popularity of objects. The prediction score of an object at time t can be calculated by Δk (t, T ) where T is the length of the considered history. Recently, a popularity-based predictor (PBP) [20] has been proposed to combine the predictor k (t) and Δk (t, T ). PBP has a tunable parameter λ ∈ [0, 1] to make the new predictor change smoothly from k (t) to Δk (t, T ). Mathematically, the prediction score of PBP is computed as This predictor simplifies to the total popularity method when λ = 0 and to the recent popularity method when λ = 1.

Temporal-based predictor

The popularity-based predictor in fact considers all the recent popularity of objects, but weakens the influence of links an object received before t − T . In the literature, it has been found that the interest toward individual objects vanishes exponentially with time in some case [13, 22]. Therefore, it is too arbitrary to simply divide the time into two segments in the popularity-based predictor, as it may lose the detailed temporal information of real networks. To make better use of the temporal information for trend prediction, we proposed a temporal-based predictor (TBP) in this paper. In TBP, we consider the mechanism that the influence of a link exponentially decays with time. An aging function is accordingly introduced to calculate the prediction scores: where T denotes the time at which user i select object α. γ is a positive parameter which controls the decay speed. A larger γ indicates a faster decay, and γ = 0 corresponds to the cumulative popularity without any decay. TBP preserves all the detailed temporal information in the network. By adjusting γ, we can study the temporal effects in the prediction of the future popularity.

Data Description

To test the performance of TBP, we use three distinct real data sets: MovieLens, Netflix and Facebook in this paper. Movielens and Netflix data sets contain movie ratings, and Facebook data set contains users’ wall post relationships. MovieLens is provided by GroupLens project at University of Minnesota (www.grouplens.org). We use their 10 million ratings data set. Each user in MovieLens data set has at least 20 ratings. Netflix is a huge data set released by the DVD rental company Netflix for its Netflix Prize (www.netflixprize.com). The original data has 480189 users, 17770 objects and 100480507 ratings. Since the original Movielens and Netflix data sets are large, we extracted a small subset from each of them by randomly choosing some users who have rated at least 20 movies and took all movies they had rated. For both Movielens and Netflix, the ratings are given on the integer scale from 1 to 5 (from worst to best). We here only consider the ratings higher than 2 as a link. The final data consists of 5000 users, 7533 movies, and 864581 links in Movielens and 4960 users, 16599 movies, and 1249058 links in Netflix. Facebook data set contains a list of all the wall posts from the Facebook New Orleans networks [23-25]. A link from one user to another corresponds that the user post on another user’s wall. As the Facebook network is a unipartite directed network, here we mapped it to a bipartite network with a set of users and a set of users’ walls (objects). If a user has posted on a wall, there will be a link between the user and the wall. The original data has 42390 users, 39986 objects and 876993 links. Since user may written on his own wall, we remove these links to eliminate self-influence. The final data consists of 40981 users, 38143 objects and 855542 links. For all of these three data sets, the time is counted by days. The characteristics of these data sets are summarized in table 1. All these three data sets are available from the Koblenz Network Collection [26], and they are all free to use even for commercial purposes (the data sets we used in this paper are free to download as S1 Dataset).

Table 1

Basic statistical features of the data sets.

Data set	Users	Objects	Links	Period
Movilens	5000	7533	8.6 × 10⁵	1st Jan 2002—1st Jan 2005
Netflix	4960	16599	1.2 × 10⁶	1st Jan 2000—31st Dec 2005
Facebook	40981	38143	8.6 × 10⁵	14th Sep 2004—22nd Jan 2009

Evaluation Metrics

We apply three metrics to give quantitative measurements of the predictors’ performance: AUC, precision and novelty. As the main point of this paper is to predict which objects will be popular in the future, only the top part of the ranking list should be considered when evaluating the performance of the predictors. We thus use a standard measure in information filtering literature named AUC [27], which evaluates a ranking list by calculating the relative position of its top n objects. We select the top n objects in the real future as a group of benchmark objects, and denoted it as set B. The other objects are in the complement group of B, which denoted as B′. Then the AUC is calculated as where AUC equals to one when all benchmark objects are ranked higher than the other objects, while AUC = 0.5 corresponds to a completely random object ranking list. Another evaluation matric is called precision. It is defined as the fraction of objects in the top n places of the estimated ranking that appear also in the top n places of the true ranking [28]. The precision of the predictor is defined as P = D /n, where D indicates the number of common objects in the top n places of the predicted ranking and the true ranking. It lies in the range [0, 1], the higher the better. It is often the case that objects popular in the future time window (t, t + T ] were already popular in the past. Successful prediction of those objects can contribute to precision P . However, prediction of these objects brings much less benefit to users than the prediction of genuinely “new entries”, i.e. objects that were missing in top n in the past but they appear there in the future time window. We label the true number of those objects as E and the number of those successfully identified by the predicted ranking as C , respectively. The rate of correct prediction of these new entries is Q = C /E . This allows us to measure how well a predictor is able to identify the potential objects. Here, we name Q as novelty.

Results

To obtain the final evaluation of the predictors’ performance, we average results over 10 randomly selected t for each data sets. To make sure that there is enough history information, t is set as at least one year later than the first record in each data set. As all the predictors we considered in this paper are based on objects’ history, we only consider the objects with at lest one link before the testing date t. Fig. 1 shows the performance of the TBP under different γ in Movielens, Netflix and Facebook data sets, respectively. T is set as 30 days for all there data sets. Different n values are given for all metrics. The results show that the influence of γ doesn’t change by n. When γ = 0 (equivalent to cumulative degree predictor), the AUC and P are relatively small and Q = 0 for all data sets. This indicates that PA have little efficacy in predicting the future popularity. A small γ (a relatively slow time decay) can significantly increase the prediction performance, especially for Q . The high value of Q indicates that the temporal-based predictor has a great power to identify “new objects” which are not yet popular. A too large γ will decrease the performance, and we can get the best performance of TBP for all data sets by changing γ. We denote γ* as the parameter resulting in the highest P value. It is clear that the performance of TBP with parameter γ* is remarkably higher than that of PA (i.e. γ = 0 in TBP).

Fig 1

The prediction result of TBP for Movielens, Netflix and Facebook data sets under different γ.

The prediction result of TBP for Movielens, Netflix and Facebook data sets under different γ.

The performance of different n values are given. n = 50 is presented by black lines with squares, n = 100 is presented by red lines with circles, and n = 200 is presented by blue lines with triangles. Table 2 shows the performance of TBP and PBP for the three data sets. The parameter for each predictor is selected as the one corresponding to the highest P value. From the results, we could find that TBP has a better performance for most evaluation metrics. The best λ value for PBP is 0.98 for both Movielens and Netflix data set and 0.93 for Facebook data set, which indicates that the links an object received long time ago has a small influence on its future popularity. Unlike the arbitrarily dividing the time into two segments in PBP, TBP uses an exponential decay function which can future improve the prediction performance.

Table 2

The prediction performance of temporal-based predictor (TBP) and popularity-based predictor (PBP) for Movielens, Netflix and Facebook data sets.

The parameter for each predictor is set as the one corresponding to the highest P value. n is set as 100.

Data set	Predictor	Parameter	AUC	P _n	Q _n
Movielens	PBP	0.98	0.988	0.706	0.432
Movielens	TBP	0.06	0.990	0.705	0.526
Netflix	PBP	0.98	0.960	0.634	0.502
Netflix	TBP	0.06	0.962	0.637	0.545
Facebook	PBP	0.93	0.893	0.372	0.217
Facebook	TBP	0.03	0.894	0.387	0.252

The prediction performance of temporal-based predictor (TBP) and popularity-based predictor (PBP) for Movielens, Netflix and Facebook data sets.

The parameter for each predictor is set as the one corresponding to the highest P value. n is set as 100. We set a rank change value for each object as dr = r − r , where r is the real rank in the near future and r is the rank by a predictor. dr = 0 indicates the prediction rank is echo to the real rank in the near future, and the predictor has perfect performance; dr < 0 indicates the predictor underestimate the object’s popularity. dr > 0 indicates the predictor overrate the object’s popularity. To test how different predictors influence the rank of the top objects in the future, we plot Fig. 2 to show the correlation of dr with these objects’ degree rank r in the testing time under different predictors and parameters. This figure only considers the top 100 objects in the T window. The parameters λ for PBP are set as the one corresponding to the highest P value. The parameters γ* and γ = 1 are also selected for TBP in Fig. 2. These objects are ranked in the top 100 positions in the near future, but their history popularity has a board distribution. Both TBP and PBP with the best parameter can reduce the absolute value of dr, which makes these predictors have better performance in predicting future popularity. Compered with PBP, TBP is better at improving the rank of objects that have high r but ranked in top place in the future. When γ = 1 in TBP, it is clear that the absolute value of rank difference d of the objects that have high r becomes smaller than the case with γ*, but the absolute value of rank difference d of objects with lower r becomes larger. That may be due to the fact that parameter γ can give less rank score to the objects that are popular in the past, and then improve the rank of the objects that are not popular in the past. The higher the parameter γ is, the more obvious this influence is.

Fig 2

The correlation of dr with r for top 100 objects in the real future.

The correlation of dr with r for top 100 objects in the real future.

Black lines with circles present the result of TBP with the best parameter γ*, red lines with triangles present the result of PBP with the best parameter λ*, and blue lines with diamonds present the result of TBP with γ = 1. T is set as 30 days for all data sets. A small T aims to predict objects’ popularity in the short term while a large T requires to predict the trend in long term. Therefore, we test the performance of the predictors under different T . Fig. 3 shows the performance of the TBP and PBP as a function of the future time window T . The parameters corresponding to the highest P value for each predictors at each T point are selected. For PBP, T is set as the same length of T . Compared with PBP, it is clear that TBP has a better prediction performance for all T value. For all data sets, the precision P , novelty Q and AUC of the predictors increase substantially with T when T is very small. This is because there is a lot of noise when T is too small. However, the precision decreases with T while T becomes larger. This may because the predicted popularity becomes outdated for larger T time.

Fig 3

The performanc of TBP and PBP as a function of the future time window T .

TBP is presented by black lines, and PBP is presented by red line. n is set as 100. Time is measured in days.

The performanc of TBP and PBP as a function of the future time window T .

TBP is presented by black lines, and PBP is presented by red line. n is set as 100. Time is measured in days. As we know, γ can control the decay speed of the influence of the old links. For different prediction time interval T , the best parameter γ* may be different. Fig. 4 shows the relationship of γ* with T . One could find that, the larger the T length, the smaller the γ* value. That means for a shorter T prediction, the objects’ recent popularity matters more. While for a larger T interval prediction, longer historical popularity should be considered.

Fig 4

The relationship of γ* with T for Movielens, Netflix and Facebook data sets.

Discussion

To summarize, we proposed a temporal-based predictor (TBP) in this paper, and studied the performance of TBP in trend prediction. The basic idea of TBP is to introduce an exponentially time decay to predict objects’ future popularity. We make use of three metrics to evaluate the predictor’s performance: P , Q and AUC. We found that the parameter γ, which controls the speed of the time decay, can give less rank score to the objects that are popular in the past, and accordingly improve the rank of the objects that are not popular in the past. The higher γ is, the more obvious of this influence is. Thus, TBP has a higher ability to detect “new entries” that have a lower cumulative popularity but a higher future popularity, and promote these objects to the front of the predicted ranking list. Compared with PBP, TBP has a higher ability to detect the objects that will be popular in the future with different future length T . Ranking is one of the most important and fundamental method to solve information over load problem. The study of the popularity dynamic of online information gives us some inspiration to solve the trend prediction problem. This is a very practical issue. In this paper, we studied the links’ temporal effects on objects’ future popularity. This study is based on the experimental finding that the ability of a node to attract new links vanishes exponentially with time. Besides time, there are a lot of other elements that can influence the dynamic of popularity, such as human dynamics, links heterogeneity, and external influence. Introducing the influence of these elements may future improve the performance of trend predictor. In our future studies, we will focus on improving the performance of the trend predictors with the help of both empirical observations and theoretical analysis.

The data sets we used in this paper.

(RAR) Click here for additional data file.

8 in total

Temporal effects in trend prediction: identifying the most popular nodes in the future.

Introduction

Materials and Methods

Popularity-based Predictor

Temporal-based predictor

Data Description

Evaluation Metrics

Results

The prediction result of TBP for Movielens, Netflix and Facebook data sets under different γ.

The prediction performance of temporal-based predictor (TBP) and popularity-based predictor (PBP) for Movielens, Netflix and Facebook data sets.

The correlation of dr with r for top 100 objects in the real future.

The performanc of TBP and PBP as a function of the future time window T .

Discussion

The data sets we used in this paper.

1. Emergence of scaling in random networks

2. Effect of aging on network structure.

3. Temporal effects in the growth of networks.

4. Dynamics of information access on the web.

5. Quantifying long-term scientific impact.

6. Robust dynamic classes revealed by measuring the response function of a social system.

7. Characterizing and modeling the dynamics of online popularity.

8. The meaning and use of the area under a receiver operating characteristic (ROC) curve.