| Literature DB >> 35310544 |
Markus Hainy1,2, David J Price3,4,5, Olivier Restif5, Christopher Drovandi2,6,7.
Abstract
Performing optimal Bayesian design for discriminating between competing models is computationally intensive as it involves estimating posterior model probabilities for thousands of simulated data sets. This issue is compounded further when the likelihood functions for the rival models are computationally expensive. A new approach using supervised classification methods is developed to perform Bayesian optimal model discrimination design. This approach requires considerably fewer simulations from the candidate models than previous approaches using approximate Bayesian computation. Further, it is easy to assess the performance of the optimal design through the misclassification error rate. The approach is particularly useful in the presence of models with intractable likelihoods but can also provide computational advantages when the likelihoods are manageable. Supplementary Information: The online version contains supplementary material available at 10.1007/s11222-022-10078-2.Entities:
Keywords: Approximate Bayesian computation; Bayesian model selection; Classification and regression tree; Continuous-time Markov process; Random forest; Simulation-based Bayesian experimental design
Year: 2022 PMID: 35310544 PMCID: PMC8924111 DOI: 10.1007/s11222-022-10078-2
Source DB: PubMed Journal: Stat Comput ISSN: 0960-3174 Impact factor: 2.324
Four competing models considered in the infectious disease example of Section 4.1
| Model | Event type | Update | Rate |
|---|---|---|---|
| 1 | Infected | ||
| 2 | Infected | ||
| 3 | Exposed | ||
| Infected | |||
| 4 | Exposed | ||
| Infected |
Fig. 1Plots of the approximated expected loss functions produced by the tree classification approach with cross-validation (solid), the random forest classification approach using out-of-bag class predictions (dotted), and the ABC approach (dashed) under the 0–1 loss (thick lines) and multinomial deviance loss (thin lines) for the infectious disease example. The expected losses have been scaled by dividing through the maximum loss for an easier comparison
Optimal designs obtained by tree classification (cross-validated), random forest classification (using out-of-bag class predictions), and ABC approaches under the 0–1 loss (01L) or multinomial deviance loss (MDL) (, 2, and 3) for the infectious disease example. The equidistant designs are also shown
| Method/Loss | ||||||
|---|---|---|---|---|---|---|
| Tree 01L | 0.598 | 0.787 | 4.437 | 0.818 | 4.568 | 9.493 |
| RF 01L | 0.611 | 0.823 | 4.433 | 0.750 | 4.000 | 10.000 |
| ABC 01L | 0.597 | 0.877 | 4.357 | 0.750 | 2.250 | 5.750 |
| Tree MDL | 0.621 | 0.750 | 4.750 | 0.750 | 4.750 | 10.000 |
| RF MDL | 0.633 | 0.750 | 4.750 | 0.750 | 4.500 | 9.000 |
| ABC MDL | 0.556 | 0.750 | 3.500 | 0.500 | 1.750 | 4.750 |
| Equidistant | 5.000 | 3.333 | 6.667 | 2.500 | 5.000 | 7.500 |
Average misclassification error rates for optimal designs obtained by tree classification (cross-validated), random forest classification (using out-of-bag class predictions), and ABC approaches under the 0–1 loss (01L) or multinomial deviance loss (MDL) as well as for the equidistant designs for the infectious disease example. The average misclassification error rates were calculated by repeating the random forest classification procedure 100 times (see text) and taking the average. The standard deviations are given in parentheses
| Design | |||||
|---|---|---|---|---|---|
| Tree 01L | 0.5554 | 0.5158 | 0.5133 | 0.5116 | 0.5129 |
| (0.0023) | (0.0024) | (0.0026) | (0.0022) | (0.0025) | |
| RF 01L | 0.5548 | 0.5160 | 0.5132 | 0.5113 | 0.5138 |
| (0.0027) | (0.0025) | (0.0025) | (0.0025) | (0.0025) | |
| ABC 01L | 0.5547 | 0.5161 | 0.5196 | 0.5046 | 0.5339 |
| (0.0023) | (0.0025) | (0.0030) | (0.0024) | (0.0027) | |
| Tree MDL | 0.5547 | 0.5178 | 0.5152 | 0.5183 | 0.5159 |
| (0.0023) | (0.0030) | (0.0028) | (0.0028) | (0.0025) | |
| RF MDL | 0.5550 | 0.5179 | 0.5118 | 0.5104 | 0.5128 |
| (0.0022) | (0.0026) | (0.0026) | (0.0026) | (0.0026) | |
| ABC MDL | 0.5553 | 0.5221 | 0.5216 | 0.5226 | 0.5416 |
| (0.0020) | (0.0028) | (0.0028) | (0.0029) | (0.0025) | |
| Equidistant | 0.6592 | 0.6200 | 0.5760 | 0.5537 | 0.5519 |
| (0.0029) | (0.0025) | (0.0027) | (0.0029) | (0.0032) |
Fig. 2For each of the optimal designs obtained by the different approaches for 1–5 observations in the infectious disease example, display the distribution of estimated ABC posterior model probabilities of the correct model over 200 process realisations (50 from each of the four models) simulated from the prior predictive distribution at the respective optimal design. For each number of design points, from left to right there are two magenta box plots for the cross-validated tree classification designs, two blue box plots for the random forest classification designs, two red box plots for the ABC classification designs, and one cyan box plot for the equispaced design. Box plots for the 0–1 loss and for the equispaced designs do not have a notch, whereas box plots for the multinomial deviance loss are notched
Fig. 3Misclassification error rates computed using random forest classification with training and test samples of size 20K, averaged over 100 repetitions of the classification procedure (left column) and misclassification error rates computed using the Gauss–Hermite quadrature approximation to the marginal likelihood over 2K prior predictive simulations (right column) evaluated at various optimal designs for different methods (in the rows) for the infectious disease example with two models. The total number of observations () is plotted on the x-axis of each graph. Each line connects the observed values of the misclassification error rate as the number of realisations q increases for a particular value of
Average misclassification error rates for the optimal designs obtained under the classification approaches using trees or random forests and for the equispaced designs for the macrophage model. The average misclassification error rates were calculated by repeating the random forest classification procedure 100 times and taking the average. The standard deviations are given in parentheses
| Design | |||||
|---|---|---|---|---|---|
| Tree | 0.1928 | 0.1323 | 0.1433 | 0.1469 | 0.1483 |
| (0.0024) | (0.0021) | (0.0022) | (0.0019) | (0.0022) | |
| RF | 0.1925 | 0.1325 | 0.1408 | 0.1410 | 0.1465 |
| (0.0022) | (0.0021) | (0.0022) | (0.0021) | (0.0021) | |
| Equi | 0.2442 | 0.1974 | 0.1928 | 0.1912 | 0.1935 |
| (0.0027) | (0.0023) | (0.0023) | (0.0025) | (0.0024) |
Fig. 4For each of the optimal designs obtained by the different approaches for 1–5 observations in the macrophage example, display the distribution of estimated posterior model probabilities of the correct model over 80 process realisations (20 from each of the three models) simulated from the prior predictive distribution at the respective optimal design. For each number of design points, the magenta box plot on the left-hand side is for the tree classification design, the notched blue box plot in the middle is for the random forest classification design (rf), and the red box plot on the right-hand side is for the equispaced design