| Literature DB >> 36016910 |
Rogini Runghen1,2,3, Daniel B Stouffer1, Giulio V Dalla Riva4.
Abstract
Networks are increasingly used in various fields to represent systems with the aim of understanding the underlying rules governing observed interactions, and hence predict how the system is likely to behave in the future. Recent developments in network science highlight that accounting for node metadata improves both our understanding of how nodes interact with one another, and the accuracy of link prediction. However, to predict interactions in a network within existing statistical and machine learning frameworks, we need to learn objects that rapidly grow in dimension with the number of nodes. Thus, the task becomes computationally and conceptually challenging for networks. Here, we present a new predictive procedure combining a statistical, low-rank graph embedding method with machine learning techniques which reduces substantially the complexity of the learning task and allows us to efficiently predict interactions from node metadata in bipartite networks. To illustrate its application on real-world data, we apply it to a large dataset of tourist visits across a country. We found that our procedure accurately reconstructs existing interactions and predicts new interactions in the network. Overall, both from a network science and data science perspective, our work offers a flexible and generalizable procedure for link prediction.Entities:
Keywords: Random Dot Product Graphs; graph embedding; link prediction; machine learning; metadata; predictive models
Year: 2022 PMID: 36016910 PMCID: PMC9399714 DOI: 10.1098/rsos.220079
Source DB: PubMed Journal: R Soc Open Sci ISSN: 2054-5703 Impact factor: 3.653
Figure 1Using a Random Dot Product Graphs framework to predict interactions in a bipartite network. Note that here we use our case study—i.e. the travelling patterns of visitors to touristic destinations—to illustrate the framework. We first consider the travelling patterns of visitors as network data. In (1), the bipartite representation of travelling patterns of visitors: nodes are of two types—circles represent visitors and squares represent places, and links indicate a trip travelled by a given visitor to a given place. Here, the solid lines represent the observed links. In (2), we first estimate the position of nodes within the observed bipartite network using a singular value decomposition (SVD) on the adjacency matrix representing the visitor–place interaction matrix. As a result, we obtain two latent feature spaces: a visitor latent feature space and place latent feature space. Note that here we show the embeddings of nodes of both visitors and places, respectively, in a latent feature space of dimension d = 2: LD1 and LD2. In (3), we then relate the node metadata directly to their latent feature spaces. To do so, we use machine learning techniques to find the relationship between the two. Step (3) allows us to reconstruct interactions between observed nodes (visitors and places, respectively). To further predict new interactions in the network using observed metadata of new visitors and new places, respectively—the models used to find the relationship of the node metadata to the latent feature spaces are used. By doing so, we can project the new visitor and new place into the respective latent feature spaces LD1 and LD2. Finally, in (4), using the dot product, we are able to predict the probability of interaction between the new visitor and the new place added to the visitation network. Here, the dashed lines represent the new predicted links for observed and new nodes.
Summary of node metadata used.
| node type | metadata | data type | classes |
|---|---|---|---|
| visitor | gender | categorical | male, female |
| visitor | age | categorical | age group: <20, 21–25, 26–34, 35–39, 40–44, 45–49, 50–51, 64–69, >70 |
| visitor | activity type | categorical | hiking, site-seeing, water activities, museums and other heritage sites, visiting family, work purposes |
| visitor | mode of transportation | categorical | car, van, boat, tour bus, bus, helicopter, aeroplane |
| place | place geolocation | continuous | latitude and longitude of locations |
| place | place type | categorical | heritage site, crown protected area, town, village, recreational site |
| place | regional council | categorical | Northland, Auckland, Waikato, Bay of Plenty, Gisborne, Hawke’s Bay, Taranaki, Manawatu-Wanganui, Wellington, Tasman/Nelson, Marlborough, West Coast, Canterbury, Otago, Southland, and areas outside regional council |
Summary of number of visitors and places used for the different steps of the predictive procedure.
| dataset | analysis | no. visitors | no. places | visitor–place interactions |
|---|---|---|---|---|
| training set (70% of full dataset) | SVD | 101 656 | 430 | 421 784 625 |
| model training set (50% of training set) | neural network | 71 159 | 301 | 210 892 312 |
| model test set (50% of training set) | neural network | 30 497 | 129 | 210 892 313 |
| validation set (30% of full dataset) | neural network, dot product | 88 286 (43 662 | 360 (120 | 140 594 875 |
Figure 2Identifying an adequate dimension d of network data. (a) The scree plot represents the singular values of the adjacency matrix of the visitation network in decreasing order of d. The x-axis shows the singular value index, and the y-axis indicates the singular values. (b) Cumulative plot showing the percentage variability explained with the increasing singular value indexes. The x-axis again shows the singular value index and the y-axis indicates the per cent of variance explained. Using Zhu & Ghodsi [33]’s profile-likelihood criterion, we picked d = 6 as indicated by the red dotted line. This dimension explains 70% of the variability of the visitation network data.
Figure 3Training of regression models over time when projecting observed visitor metadata onto the latent feature space using an adaptive moment estimation (Adam) optimizer run with 30 epochs and a batch size of 20. The plot shows the model validation—i.e. the subset of training visitation dataset used—to validate the three different models in finding the best mapping from the node metadata to the latent feature space. The x-axis indicates the epochs. The y-axis indicates the mean absolute error (MAE), which is the cost function used to measure the accuracy of model predictions—i.e. it measures the distance between the estimated latent feature space (SVD) and the predicted latent feature space. The red line shows the learning rate of the linear regression model (baseline), the green line indicates the learning rate of the multilayer perceptron model (MLP), and the blue line indicates the neural network with two hidden layers (NN). The plot shows that both the MLP model and NN model performed better than the baseline model.
Figure 4Training of regression models over time when projecting observed place metadata onto the latent feature space using an adaptive moment estimation (Adam) optimizer run with 30 epochs and a batch size of 20. The plot shows the model validation—i.e. the subset of training visitation dataset used—to validate the three different models in finding the best mapping from the node metadata to the latent feature space. The x-axis indicates the epochs. The y-axis indicates the mean absolute error (MAE), which is the cost function used to measure the accuracy of model predictions—i.e. it measures the distance between the estimated latent feature space (SVD) and the predicted latent feature space. The red line shows the learning rate of the linear regression model (baseline), the green line indicates the learning rate of the multilayer perceptron model (MLP), and the blue line indicates the neural network with two hidden layers (NN). Here the MLP model seems to perform better than the baseline and NN models.
Accuracy of model predictions obtained from RDPG-regression procedure. (The table indicates the area under curve (AUC) values for each model calculated using mean absolute error as the cost function to measure the distance between the estimated latent feature spaces and the predicted latent feature space. Note that the value in italics indicates the model with the highest AUC value across all models. The values in brackets are AUC values computed at 95% confidence interval (CI).)
| place | ||||
|---|---|---|---|---|
| baseline | MLP | NN | ||
| visitor | baseline | 0.630 (95% CI [0.623, 0.625]) | 0.699 (95% CI [0.596, 0.6002]) | |
| MLP | 0.645 (95% CI [0.629, 0.634]) | 0.701 (95% CI [0.702, 0.715]) | 0.665 (95% CI [0.617, 0.621]) | |
| NN | 0.653 (95% CI [0.630, 0.635]) | 0.699 (95% CI [0.619, 0.622]) | 0.670 (95% CI [0.605, 0.609]) | |