| Literature DB >> 25259608 |
Alexey Miroshnikov1, Erin M Conlon1.
Abstract
Recent advances in big data and analytics research have provided a wealth of large data sets that are too big to be analyzed in their entirety, due to restrictions on computer memory or storage size. New Bayesian methods have been developed for data sets that are large only due to large sample sizes. These methods partition big data sets into subsets and perform independent Bayesian Markov chain Monte Carlo analyses on the subsets. The methods then combine the independent subset posterior samples to estimate a posterior density given the full data set. These approaches were shown to be effective for Bayesian models including logistic regression models, Gaussian mixture models and hierarchical models. Here, we introduce the R package parallelMCMCcombine which carries out four of these techniques for combining independent subset posterior samples. We illustrate each of the methods using a Bayesian logistic regression model for simulation data and a Bayesian Gamma model for real data; we also demonstrate features and capabilities of the R package. The package assumes the user has carried out the Bayesian analysis and has produced the independent subposterior samples outside of the package. The methods are primarily suited to models with unknown parameters of fixed dimension that exist in continuous parameter spaces. We envision this tool will allow researchers to explore the various methods for their specific applications and will assist future progress in this rapidly developing field.Entities:
Mesh:
Year: 2014 PMID: 25259608 PMCID: PMC4178156 DOI: 10.1371/journal.pone.0108425
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Results for the simulation data of the Bayesian logistic regression model, for the marginal of the parameter.
(a) full data posterior density and 10 subposterior densities for the 10 data subsets; (b)-(f): full data and estimated combined posterior densities for: (b) sample average method; (c) consensus Monte Carlo independence method; (d) consensus Monte Carlo covariance method; (e) semiparametric density product estimator method, with default settings of the function; (f) semiparametric density product estimator method with the same settings as (e) except the argument anneal = FALSE. The consensus Monte Carlo covariance method produces the smallest L distance (see Table 1).
Estimated relative L distances.
| Subposterior Samples Combining Method | Bayesian Model | |
| Logistic Regression, | Gamma, | |
|
| 0.034 | 0.024 |
|
| 0.024 | 0.020 |
|
| 0.015 | 0.016 |
|
| 0.046 | 0.022 |
|
| 0.020 | 0.021 |
Estimated relative L distances for each of the methods of combining subposterior samples to estimate posterior densities given the full data set. Results are included for the Bayesian logistic regression model with the simulated data set for the marginal densities of the parameter, and the Bayesian Gamma model with the real airlines data set for the marginal densities of the parameter.
Figure 2Results for the real airlines data of the Bayesian Gamma model, for the marginal of the parameter.
(a) full data posterior density and 5 subposterior densities for the 5 data subsets; (b)-(f): full data and estimated combined posterior densities for (b) sample average method; (c) consensus Monte Carlo independence method; (d) consensus Monte Carlo covariance method; (e) semiparametric density product estimator method, with default settings of the function; (f) semiparametric density product estimator method with the same settings as (e) except the argument anneal = FALSE. The consensus Monte Carlo covariance method produces the smallest L distance (see Table 1).
Computational time (in seconds) for the four combining methods.
| Number of Subsets | Combining Method | Number of Model Parameters | |||
| 2 | 5 | 10 | 50 | ||
|
|
| 0.06 (0.04) | 0.09 | 0.13 | 0.48 |
|
| 2 (2) | 2 | 2 | 3 | |
|
| 2 (2) | 2 | 3 | 23 | |
|
| 401 (402) | 432 | 464 | 1136 | |
|
|
| 0.06 | 0.10 (0.12) | 0.20 | 0.89 |
|
| 2 | 2 (2) | 2 | 4 | |
|
| 4 | 4 (4) | 5 | 36 | |
|
| 795 | 816 (820) | 880 | 2119 | |
|
|
| 0.08 | 0.16 | 0.31 | 1 |
|
| 2 | 2 | 2 | 7 | |
|
| 6 | 7 | 10 | 70 | |
|
| 1540 | 1602 | 1729 | 4102 | |
|
|
| 0.34 | 0.72 | 1 | 7 |
|
| 3 | 4 | 5 | 27 | |
|
| 29 | 35 | 47 | 343 | |
|
| 7522 | 8015 | 8675 | 22540 | |
Computational times, in seconds (rounded unless less than 1 second), for the four methods of the R package parallelMCMCcombine, using simulation data and T = 50,000 MCMC samples. The values in parentheses are for our example data sets; d = 2, M = 5 is for the Gamma model, and d = 5, M = 10 is for the logistic model. The results are based on a computer with operating system Windows 7 and an Intel Celeron 1007U CPU 1.5 GHz Processor.
Computational time (in minutes) for producing the MCMC samples.
| Number of subsets | Bayesian Logistic Regression Model, | Bayesian Gamma Model, | ||||
| Data Points Per Subset | Time Per Subset | Total Time | Data Points Per Subset | Time Per Subset | Total Time | |
|
| 20,000 | 174 | 870 | 65,981 | 256 | 1,280 |
|
| 10,000 | 85 | 850 | 32,991 | 139 | 1,390 |
|
| 5,000 | 41 | 820 | 16,496 | 65 | 1,300 |
|
| 100,000 | 954 | 954 | 329,905 | 1,397 | 1,397 |
Average computational times per subset, in minutes (rounded), for producing T = 52,000 samples (including burnin) for the data examples, and total computational times. The results are based on the WinBUGS software program and a computer with operating system Windows 7 and an Intel Core i7-4600U CPU 2.1 GHz Processor. Note that the R package parallelMCMCcombine is not used to create these samples; the MCMC samples are used as input to the parallelMCMCcombine package.