| Literature DB >> 28724937 |
Yulia R Gel1, Vyacheslav Lyubchich2, L Leticia Ramirez Ramirez3.
Abstract
We propose a new method of nonparametric bootstrap to quantify estimation uncertainties in functions of network degree distribution in large ultra sparse networks. Both network degree distribution and network order are assumed to be unknown. The key idea is based on adaptation of the "blocking" argument, developed for bootstrapping of time series and re-tiling of spatial data, to random networks. We first sample a set of multiple ego networks of varying orders that form a patch, or a network block analogue, and then resample the data within patches. To select an optimal patch size, we develop a new computationally efficient and data-driven cross-validation algorithm. The proposed fast patchwork bootstrap (FPB) methodology further extends the ideas for a case of network mean degree, to inference on a degree distribution. In addition, the FPB is substantially less computationally expensive, requires less information on a graph, and is free from nuisance parameters. In our simulation study, we show that the new bootstrap method outperforms competing approaches by providing sharper and better-calibrated confidence intervals for functions of a network degree distribution than other available approaches, including the cases of networks in an ultra sparse regime. We illustrate the FPB in application to collaboration networks in statistics and computer science and to Wikipedia networks.Entities:
Year: 2017 PMID: 28724937 PMCID: PMC5517433 DOI: 10.1038/s41598-017-05885-x
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Steps of the LSMI algorithm with m = 2 seeds and d = 3 waves applied to a network of order n = 23.
Figure 2Histograms of bootstrap mean degrees for a simulated network of order 10,000 with polylogarithmic(0.1, 2) degree distribution. The 95% confidence intervals (dashed vertical lines) are for μ(G) = 2.42 (solid vertical lines).
Figure 3Theoretical degree distributions.
Coverage of theoretical probabilities f(k) of observing a node of degree k, k = 2, …, 5, by 95% confidence intervals for varying network orders.
| Degree distribution |
| Method | Network order | |||
|---|---|---|---|---|---|---|
| 2,000 | 3,000 | 5,000 | 10,000 | |||
| Zero-truncated Poisson(2) | 2 | FPB | 92.4 (0.15) | 93.3 (0.15) | 93.7 (0.16) | 94.7 (0.15) |
| NCI{50} | 93.0 (0.25) | 92.6 (0.25) | 92.0 (0.26) | 93.2 (0.26) | ||
| QCI{50*} | 93.5 (0.25) | 94.5 (0.25) | 92.9 (0.25) | 94.4 (0.25) | ||
| 3 | FPB | 94.9 (0.13) | 96.2 (0.13) | 92.9 (0.13) | 95.8 (0.13) | |
| NCI{50} | 93.4 (0.22) | 95.3 (0.22) | 95.1 (0.22) | 94.6 (0.22) | ||
| QCI{50*} | 92.2 (0.22) | 93.2 (0.22) | 93.5 (0.22) | 93.0 (0.22) | ||
| 4 | FPB | 96.4 (0.10) | 97.3 (0.10) | 97.9 (0.10) | 97.7 (0.10) | |
| NCI{50} | 89.5 (0.17) | 88.7 (0.17) | 89.8 (0.17) | 89.8 (0.17) | ||
| QCI{50*} | 90.0 (0.16) | 89.6 (0.16) | 90.0 (0.16) | 89.1 (0.16) | ||
| 5 | FPB | 98.0 (0.06) | 97.8 (0.06) | 98.9 (0.06) | 98.4 (0.06) | |
| NCI{50} | 87.9 (0.10) | 86.7 (0.10) | 86.1 (0.10) | 85.9 (0.10) | ||
| QCI{50*} | 88.0 (0.09) | 86.7 (0.09) | 86.2 (0.09) | 85.9 (0.09) | ||
| polylogarithmic(0.1,2) | 2 | FPB | 92.2 (0.13) | 92.5 (0.13) | 92.3 (0.14) | 94.0 (0.13) |
| NCI{50} | 91.5 (0.23) | 90.8 (0.23) | 91.6 (0.23) | 91.8 (0.24) | ||
| QCI{50*} | 93.8 (0.23) | 93.6 (0.23) | 94.5 (0.23) | 93.5 (0.23) | ||
| 3 | FPB | 92.5 (0.11) | 95.3 (0.11) | 96.6 (0.11) | 96.0 (0.11) | |
| NCI{50} | 93.6 (0.19) | 91.7 (0.19) | 91.5 (0.19) | 91.7 (0.19) | ||
| QCI{50*} | 93.6 (0.19) | 91.7 (0.18) | 91.7 (0.18) | 92.9 (0.18) | ||
| 4 | FPB | 93.9 (0.08) | 96.5 (0.08) | 96.7 (0.09) | 98.2 (0.09) | |
| NCI{50} | 90.0 (0.14) | 91.4 (0.15) | 90.9 (0.15) | 93.4 (0.15) | ||
| QCI{50*} | 89.9 (0.14) | 91.4 (0.14) | 90.9 (0.14) | 93.3 (0.14) | ||
| 5 | FPB | 97.3 (0.07) | 96.8 (0.06) | 98.0 (0.06) | 98.7 (0.07) | |
| NCI{50} | 93.2 (0.11) | 92.1 (0.11) | 92.1 (0.11) | 91.8 (0.11) | ||
| QCI{50*} | 92.5 (0.10) | 91.6 (0.10) | 91.9 (0.10) | 91.5 (0.10) | ||
| polylogarithmic(2,3) | 2 | FPB | 96.0 (0.13) | 95.1 (0.13) | 95.8 (0.13) | 96.7 (0.14) |
| NCI{50} | 89.9 (0.19) | 92.7 (0.19) | 92.0 (0.19) | 90.7 (0.19) | ||
| QCI{50*} | 90.6 (0.18) | 93.3 (0.19) | 93.0 (0.18) | 92.7 (0.18) | ||
| 3 | FPB | 96.0 (0.08) | 96.0 (0.08) | 98.6 (0.08) | 97.3 (0.08) | |
| NCI{50} | 90.1 (0.10) | 89.2 (0.10) | 88.8 (0.10) | 90.5 (0.11) | ||
| QCI{50*} | 89.7 (0.10) | 88.7 (0.10) | 88.7 (0.10) | 89.7 (0.10) | ||
| 4 | FPB | 96.8 (0.05) | 95.8 (0.05) | 95.6 (0.05) | 96.1 (0.05) | |
| NCI{50} | 59.3 (0.06) | 58.7 (0.06) | 59.4 (0.06) | 60.6 (0.06) | ||
| QCI{50*} | 58.2 (0.05) | 57.1 (0.05) | 58.2 (0.05) | 59.5 (0.05) | ||
| 5 | FPB | 86.7 (0.03) | 87.0 (0.03) | 86.8 (0.03) | 86.2 (0.03) | |
| NCI{50} | 33.4 (0.03) | 34.8 (0.03) | 32.5 (0.03) | 34.2 (0.03) | ||
| QCI{50*} | 33.2 (0.02) | 34.8 (0.02) | 32.2 (0.02) | 34.0 (0.02) | ||
Average interval width is given in parentheses. Methods of obtaining confidence intervals are fast patchwork bootstrap (FPB), normal interval based on estimated proportions and their variance using 50 random nodes (NCI{50}), and bootstrap of 50 random nodes (QCI{50*}). Number of bootstrap resamples is 500. Number of Monte Carlo simulations is 1,000.
The 95% confidence intervals for the population probabilities f(k) of two collaboration networks.
|
|
| FPB | NCI{50} | QCI{50*} | |||
|---|---|---|---|---|---|---|---|
| Lower | Upper | Lower | Upper | Lower | Upper | ||
| Network of co-authors in computer science: | |||||||
| 1 | 0.136 | 0.093 | 0.197 | 0.104 | 0.336 | 0.100 | 0.340 |
| 2 | 0.186 | 0.140 | 0.280 | 0.029 | 0.211 | 0.040 | 0.220 |
| 3 | 0.157 | 0.087 | 0.187 | 0.057 | 0.263 | 0.080 | 0.271 |
| 4 | 0.111 | 0.066 | 0.142 | 0.000 | 0.095 | 0.000 | 0.100 |
| 5 | 0.081 | 0.044 | 0.115 | 0.004 | 0.156 | 0.020 | 0.160 |
| Network of co-authors in statistics: | |||||||
| 1 | 0.264 | 0.218 | 0.354 | 0.089 | 0.311 | 0.100 | 0.300 |
| 2 | 0.292 | 0.135 | 0.311 | 0.227 | 0.493 | 0.240 | 0.520 |
| 3 | 0.162 | 0.081 | 0.243 | 0.030 | 0.210 | 0.040 | 0.200 |
| 4 | 0.088 | 0.014 | 0.122 | 0.000 | 0.059 | 0.000 | 0.060 |
| 5 | 0.055 | 0.037 | 0.148 | 0.030 | 0.210 | 0.040 | 0.220 |
Methods of obtaining confidence intervals are fast patchwork bootstrap (FPB), normal interval based on estimated proportions and their variance using 50 random nodes (NCI{50}), and bootstrap of 50 random nodes (QCI{50*}). In FPB, 12 seed-wave combinations were considered: waves from 1 to 3, seeds 20, 30, 40, and 50. Cross-validation is based on a random selection of 100 seeds 13 times. Number of bootstrap resamples is 500.
Figure 4Observed frequencies (points) and FPB 95% intervals (lines) for f(k), for the two networks of researchers.
The 95% FPB confidence intervals for the mean degrees of Wikipedia networks, constructed based on the links (edges) between talk and user pages (nodes) in different languages.
| Network |
|
| Optimal combination | 95% confidence bounds for the mean degree | ||
|---|---|---|---|---|---|---|
| Seeds | Waves | Lower | Upper | |||
| Hebrew | 7,856,666 | 5.90 | 40 | 1 | 2.70 | 9.82 |
| Italian | 25,951,119 | 9.59 | 20 | 1 | 7.97 | 17.56 |
| Norwegian | 3,824,079 | 4.16 | 50 | 1 | 1.33 | 6.11 |
| Russian | 19,415,432 | 8.47 | 40 | 1 | 6.26 | 15.33 |
The analysis considered 12 seed-wave combinations: seeds 20, 30, 40, and 50; waves from 1 to 3, and 500 bootstrap resamples per each combination. Cross-validation is based on a random selection of 10 seeds 10 times.
Figure 5Estimated mean degrees of the Wikipedia networks in Hebrew, Italian, Norwegian, and Russian vs. percent of people in corresponding countries who can speak English.