| Literature DB >> 34752460 |
Abstract
A comprehensive methodology for semiparametric probability density estimation is introduced and explored. The probability density is modelled by sequences of mostly regular or steep exponential families generated by flexible sets of basis functions, possibly including boundary terms. Parameters are estimated by global maximum likelihood without any roughness penalty. A statistically orthogonal formulation of the inference problem and a numerically stable and fast convex optimization algorithm for its solution are presented. Automatic model selection over the type and number of basis functions is performed with the Bayesian information criterion. The methodology can naturally be applied to densities supported on bounded, infinite or semi-infinite domains without boundary bias. Relationships to the truncated moment problem and the moment-constrained maximum entropy principle are discussed and a new theorem on the existence of solutions is contributed. The new technique compares very favourably to kernel density estimation, the diffusion estimator, finite mixture models and local likelihood density estimation across a diverse range of simulation and observation data sets. The semiparametric estimator combines a very small mean integrated squared error with a high degree of smoothness which allows for a robust and reliable detection of the modality of the probability density in terms of the number of modes and bumps.Entities:
Mesh:
Year: 2021 PMID: 34752460 PMCID: PMC8577774 DOI: 10.1371/journal.pone.0259111
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Overview of domains and basis functions for semiparametric density estimation.
| Domain | Rescaling | Basis functions { |
|---|---|---|
|
|
| |
|
| ||
|
| ||
|
|
| |
|
| ||
|
| ||
|
| { | |
| splines on [ | ||
|
| { | |
| even splines on [ | ||
|
|
| |
|
|
Overview of density estimators.
See S3 Appendix for details.
| Acronym | Description |
|---|---|
| KDE1 | Gaussian kernel density estimator with rule-of-thumb bandwidth [ |
| reflection method applied at any boundary point | |
| KDE2 | MATLAB in-built Gaussian kernel density estimator with default bandwidth; reflection method applied at any boundary point |
| KDE3 | Gaussian kernel density estimator with plug-in bandwidth proposed in [ |
| DE0 | Diffusion estimator with |
| DE1 | Diffusion estimator with |
| BMM | Beta mixture model [ |
| GMM | Gaussian mixture model [ |
| GaMM | Gamma mixture model [ |
| WMM | Weibull mixture model [ |
| POL | Exponential families generated by (linearly extrapolated) polynomials; optional boundary terms; model selection by BIC |
| SPL1 | Exponential families generated by (linearly extrapolated) splines; equidistant knots; knot deletion; optional boundary terms; model selection by BIC |
| SPL2 | Exponential families generated by (linearly extrapolated) splines; equidistant knots with respect to the empirical distribution; knot deletion; optional boundary terms; model selection by BIC |
| TRIG | Exponential families generated by trigonometric functions; optional boundary terms; model selection by BIC |
| SPDE | Best model out of POL, SPL1, SPL2 and TRIG selected by BIC |
| LLDE0 | Local likelihood density estimator [ |
| LLDE1 | Local likelihood density estimator [ |
| LLDE2 | Local likelihood density estimator [ |
Test cases of probability densities: Densities 1–12 are supported on (−∞, ∞), density 13 on [−2, 2], density 14 on [0, 1] and density 15 on [0, ∞).
N(μ, σ2) denotes the normal density with mean μ and standard deviation σ, Beta(α, β) denotes the beta density with shape parameters α and β, and Gam(α, β) denotes the gamma density with shape parameter α and rate parameter β.
| Test case | Description | Probability density function |
|---|---|---|
| 1 | normal | |
| 2 | strongly skewed unimodal |
|
| 3 | kurtotic unimodal |
|
| 4 | outlier unimodal |
|
| 5 | unimodal, 2 bumps |
|
| 6 | skewed bimodal |
|
| 7 | separated bimodal |
|
| 8 | asymmetric bimodal |
|
| 9 | trimodal |
|
| 10 | 5 modes |
|
| 11 | claw |
|
| 12 | asymmetric claw |
|
| 13 | skewed bimodal | Test case 6 truncated to [−2, 2] |
| 14 | asymmetric bimodal |
|
| 15 | asymmetric bimodal |
|
Mean integrated squared error (scaled by 104) estimated as the average over 100 realisations for the test cases in Table 3.
For test cases 14 and 15, the semiparametric estimators allow for logarithmic boundary terms at any boundary point. The best method is indicated in bold. For test cases 9 and 10, the diffusion estimators are unstable; for test cases 4, 8 and 11, they give reasonable density estimates in most realisations but are unstable in some realisations. See S3 Appendix for further explanation.
| Test case |
| KDE1/2/3 | DE0/1 | POL | SPL1/2 | TRIG | SPDE | LLDE0/1/2 |
|---|---|---|---|---|---|---|---|---|
| 1 | 500 | 17/16/17 | 16/11 |
| 8.3/8.4 |
| 18/10/8.1 | |
| 2 | 1000 | 750/632/104 | 247/203 | 131 | 158/82 |
| 218/249/152 | |
| 2 | 5000 | 475/372/36 | 54/41 | 77 | 71/ |
| 81/75/58 | |
| 3 | 1000 | 304/417/80 | 160/109 | 431 | 142/ |
| 304/289/159 | |
| 3 | 5000 | 123/187/33 | 60/37 | 246 | 32/25 |
| 109/94/45 | |
| 4 | 1000 | 82/79/108 | (400)/(236) | 82 | 81 | 1028/503/108 | ||
| 4 | 5000 | (112)/(77) | 51 | 31/93 | 31 | 226/97/32 | ||
| 5 | 1000 | 8.3/9.7/8.9 | 8.8/ | 13 | 10/10 | 8.9 | 10/11/10 | |
| 5 | 10000 | 1.5/1.9/1.6 | 2.3/ | 2.1 | 2.8/1.8 | 1.8 | 2.2/2.9/2.2 | |
| 6 | 1000 | 24/45/20 | 20/ | 33 | 33/22 | 20 | 26/31/24 | |
| 6 | 10000 | 5.0/12/3.2 | 4.5/ | 5.1 | 4.0/4.9 | 3.6 | 5.1/6.4/5.5 | |
| 7 | 500 | 256/592/35 | 32/ | 38 | 36/89 | 37 | 39/ | |
| 7 | 5000 | 68/280/6.1 | 8.1/4.4 | 6.0 |
| 7.7/4.9/4.5 | ||
| 8 | 1000 | 1623/1735/52 | (513)/(458) | 38 | 35/71 | 35 | 93/86/ | |
| 8 | 5000 | 1134/1322/14 | (93)/(51) | 7.4 | 4.5/25 | 4.5 | 28/27/ | |
| 9 | 1000 | 322/373/8.9 |
| 7.5/11 | 3.8 | 36/24/4.0 | ||
| 9 | 10000 | 282/343/1.8 |
| 0.9/3.0 | 0.5 | 3.7/4.0/0.4 | ||
| 10 | 1000 | 219/227/6.8 |
| 14/10 |
| 8.9/9.4/3.1 | ||
| 10 | 10000 | 196/208/1.2 |
| 1.7/2.6 |
| 2.0/2.2/ | ||
| 11 | 1000 | 393/436/ | (258)/(183) | 511 | 118/89 | 86 | 147/170/104 | |
| 11 | 10000 | 206/293/ | (79)/(53) | 396 | 154/20 | 20 | 24/23/14 | |
| 12 | 1000 | 159/202/76 | 118/80 | 129 | 104/170 | 104 | 84/80/ | |
| 12 | 10000 | 86/122/15 | 53/39 | 61 | 51/57 | 51 | 17/16/ | |
| 13 | 1000 | 29/50/25 | 25/27 | 38 | 31/ | 29 | 26 | 130/77/103 |
| 13 | 10000 | 6.8/16/4.1 | 3.7/3.6 | 5.5 | 4.3 |
| 28/18/28 | |
| 14 | 1000 | 155/271/128 | 138/138 | 80 | 74/78 |
| 70 | 218/161/280 |
| 14 | 5000 | 63/138/37 | 43/39 | 22 | 21/ | 27 | 26 | 96/46/73 |
| 15 | 1000 | 126/238/ | 54/67 | 95 | 61/62 | 58 | 80/68/80 | |
| 15 | 5000 | 56/125/ | 19/19 | 69 | 25/17 | 16 | 31/28/22 |
Detection of modality: Number of realisations out of 100 in which the modality of the density is correctly identified.
The upper row refers to modes, the lower row to bumps. Modes and bumps are counted on the indicated interval. For test cases 14 and 15, the semiparametric estimators allow for logarithmic boundary terms at any boundary point. The best method is indicated in bold. For test case 8, the diffusion estimators give reasonable density estimates in most realisations but are unstable in some realisations. See S3 Appendix for further explanation.
| Test case |
| Interval | KDE1/2/3 | DE0/1 | POL | SPL1/2 | TRIG | SPDE | LLDE0/1/2 |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 500 | [−5.5, 5.5] | 56/75/34 | 85/95 |
|
| 82/18/65 | ||
| 0/0/4 | 0/3 |
|
| 0/0/65 | |||||
| 1 | 500 | [−3.5, 3.5] | 72/89/94 | 86/99 |
|
| 89/98/98 | ||
| 0/0/20 | 8/75 |
|
| 30/84/94 | |||||
| 2 | 1000 | [−4, 4] | 16/4/0 | 0/0 | 3 | 89/ |
| 0/0/0 | |
| 0/0/0 | 0/0 | 0 | 59/ | 76 | 0/0/0 | ||||
| 2 | 5000 | [−4, 4] | 26/8/0 | 0/0 | 16 | 96/ |
| 0/0/0 | |
| 0/0/0 | 0/0 | 0 | 71/ |
| 0/0/0 | ||||
| 5 | 1000 | [−4, 6] | 59/90/68 | 44/84 | 99 |
| 42/71/78 | ||
| 1/18/20 | 16/84 | 55 | 96 | 21/53/65 | |||||
| 5 | 10000 | [−4, 6] | 56/86/51 | 4/69 |
| 99/99 | 99 | 44/65/68 | |
| 0/23/0 | 0/38 |
|
| 23/40/61 | |||||
| 6 | 1000 | [−5.5, 5.5] | 67/79/0 | 35/61 | 57 | 51/ |
| 39/1/0 | |
| 0/0/0 | 0/1 | 32 | 92/ | 97 | 0/0/0 | ||||
| 6 | 1000 | [−3.5, 3.5] | 84/ | 36/78 | 66 | 51/81 | 81 | 46/49/54 | |
| 0/13/0 | 0/4 | 37 | 92/ | 97 | 5/11/29 | ||||
| 6 | 10000 | [−3.5, 3.5] | 86/ | 1/59 | 71 | 98/ | 98 | 45/51/73 | |
| 0/28/0 | 0/0 | 4 | 64/ | 79 | 3/6/30 | ||||
| 7 | 500 | [−4, 4] | 86/99 |
|
| 62/1/39 | |||
| 18/75 | 73 | 96/42 | 95 | 17/1/1 | |||||
| 8 | 1000 | [−0.5, 8] | (0)/(0) | 94 | 98/95 | 98 | 1/0/0 | ||
| 33/37/0 | (0)/(0) | 30 | 82 | 0/0/0 | |||||
| 13 | 1000 | [−2, 2] | 80/ | 71/77 | 75 | 76/76 | 84 | 80 | 0/27/13 |
| 0/0/0 | 0/2 | 64 | 98/ | 32 | 87 | 0/13/2 | |||
| 13 | 10000 | [−2, 2] | 83/ | 44/57 |
|
|
| 0/15/2 | |
| 0/0/0 | 0/0 | 50 | 13 |
| 0/0/0 | ||||
| 14 | 1000 | [0, 1] | 83/99/49 | 60/76 | 92 | 97 | 98 | 0/12/0 | |
| 0/0/0 | 0/1 |
| 99/99 |
|
| 0/4/0 | |||
| 14 | 5000 | [0, 1] | 79/99/8 | 4/33 |
|
|
| 0/0/0 | |
| 0/0/0 | 0/0 | 99 | 98/ |
|
| 0/0/0 | |||
| 15 | 1000 | [0, 4] | 6/32/0 | 0/0 | 3 | 43/ | 70 | 0/0/0 | |
| 0/0/0 | 0/0 | 2 | 8/ | 32 | 0/0/0 | ||||
| 15 | 5000 | [0, 4] | 0/19/0 | 0/0 | 2 | 51/ | 93 | 2/0/0 | |
| 0/0/0 | 0/0 | 1 | 10/ | 65 | 0/0/0 |
Fig 1Test case 2: Example of density estimates obtained from the kernel density and diffusion estimators (left) as well as from the semiparametric and local likelihood estimators (right).
The SPDE is the model SPL2 with 2 knots after knot deletion. The sample size is N = 2000.
Fig 2Test case 6: Example of density estimates obtained from the kernel density and diffusion estimators (left) as well as from the semiparametric and local likelihood estimators (right).
The SPDE is the model SPL2 with 3 knots after knot deletion. The sample size is N = 2000.
Fig 3Test case 6: Left: Example of model selection, corresponding to the example shown in Fig 2.
Sample size is N = 2000. Right: Relative frequency out of 250 realisations of picking a particular model as the SPDE. The spline models encompass models without and with knot deletion.
Test case 14: Number of realisations out of 100 in which certain combinations of logarithmic boundary terms are chosen for the various semiparametric estimators and different sample sizes.
| Model | POL | SPL1 | SPL2 | TRIG | SPDE |
|---|---|---|---|---|---|
| Sample size | 1000/5000 | 1000/5000 | 1000/5000 | 1000/5000 | 1000/5000 |
| No boundary terms | 14/1 | 55/3 | 51/1 | 13/4 | 23/1 |
| Only left boundary term | 15/75 | 41/63 | 36/54 | 2/5 | 5/28 |
| Only right boundary term | 0/2 | 2/2 | 9/7 | 0/0 | 0/3 |
| Left+right boundary terms | 71/22 | 2/32 | 4/38 | 85/91 | 72/68 |
Test case 15: Number of realisations out of 100 in which no boundary term or a logarithmic boundary term is chosen for the various semiparametric estimators and different sample sizes.
| Model | POL | SPL1 | SPL2 | SPDE |
|---|---|---|---|---|
| Sample size | 1000/5000 | 1000/5000 | 1000/5000 | 1000/5000 |
| No boundary term | 88/49 | 53/13 | 55/14 | 52/11 |
| Left boundary term | 12/51 | 47/87 | 45/86 | 48/89 |
Effect of boundary terms on model accuracy: Mean integrated squared error (scaled by 104) estimated as an average over 100 realisations.
The upper row refers to models without boundary terms, the lower row to models allowing for logarithmic boundary terms at any boundary point.
| Test case |
| POL | SPL1/2 | TRIG | SPDE |
|---|---|---|---|---|---|
| 14 | 1000 | 105 | 101/104 | 96 | 93 |
| 80 | 74/78 | 63 | 70 | ||
| 14 | 5000 | 27 | 37/38 | 34 | 36 |
| 22 | 21/17 | 27 | 26 | ||
| 15 | 1000 | 92 | 65/63 | 62 | |
| 95 | 61/62 | 58 | |||
| 15 | 5000 | 68 | 25/19 | 18 | |
| 69 | 25/17 | 16 |
Boundary bias (scaled by 104) estimated as the mean over 100 realisations.
The true density at the boundary is f(−2) = 0.0427 and f(2) = 0.1450 for test case 13, f(0) = f(1) = 0 for test case 14, and f(0) = 0 for test case 15. For the semiparametric estimators and test cases 14 and 15, the upper row refers to models without boundary terms and the lower row to models allowing for logarithmic boundary terms at any boundary point. The best method is indicated in bold.
| Test case |
|
| KDE1/2/3 | DE0/1 | POL | SPL1/2 | TRIG | SPDE | LLDE0/1/2 |
|---|---|---|---|---|---|---|---|---|---|
| 13 | 1000 | -2 | 171/289/124 | 145/103 | 144 | -122 | -15 | 6/75/-51 | |
| 13 | 1000 | 2 | 1081/1438/837 | 850/898 | 221 | 225/51 | 203 | 60 | |
| 14 | 1000 | 0 | 7990/9958/6621 | 6508/6909 | 4115 | 4085/3969 | 3734 | 3652 | 2784/2195/917 |
| 541 | 2110/2182 |
| 772 | ||||||
| 14 | 1000 | 1 | 357/957/186 | 365/185 | 55 | 106/109 | 70 | 107 | 11/21/ |
| 30 | 96/100 | 13 | 34 | ||||||
| 14 | 5000 | 0 | 6410/8444/4380 | 4293/4487 | 2468 | 2989/2849 | 2803 | 2765 | 2006/1153/594 |
|
| 139/228 | 122 | 95 | ||||||
| 14 | 5000 | 1 | 162/462/44 | 105/43 | 22 | 36/80 | 46 | 40 | 5/ |
| 40 | 49/55 | 10 | 23 | ||||||
| 15 | 1000 | 0 | 2879/3496/1740 | 1407/1826 | 483 | 981/806 | 897 | 1306/905/422 | |
|
| 517/496 | 491 | |||||||
| 15 | 5000 | 0 | 2308/2919/1206 | 926/1257 | 533 | 741/549 | 576 | 1112/752/347 | |
| 177 | 99/83 |
|
Fig 4Test case 15: Example of density estimates from selected methods (left) and corresponding model selection for the SPDE (right).
Only the best spline model after each knot deletion step is indicated to increase the readability of the plot. The SPDE is the model SPL2 with 4 knots after knot deletion augmented with a logarithmic boundary term. The sample size is N = 2000.
Old Faithful geyser data: Bayesian information criterion, cross-validated log-likelihood, number of modes and number of bumps for various density estimators.
The number of modes and bumps is counted on the interval [1.25, 5.5] for the eruption duration data and on the interval [35, 100] for the waiting time data. The best method is indicated in bold.
| Duration | Waiting time | |||||||
|---|---|---|---|---|---|---|---|---|
| BIC | cvlogl | Modes | Bumps | BIC | cvlogl | Modes | Bumps | |
| KDE1 | -302.0 | 2 | 2 | -1044.4 | 2 | 2 | ||
| KDE2 | -300.8 | 2 | 2 | -1045.0 | 2 | 3 | ||
| KDE3 | -276.9 | 2 | 2 | -1049.3 | 2 | 2 | ||
| DE0 | -275.8 | 2 | 2 | -1047.2 | 2 | 2 | ||
| DE1 | -275.1 | 2 | 2 | -1044.3 | 2 | 2 | ||
| GMM | 572.7 | -274.2 | 3 | 3 | 2096.0 | -1038.9 | 2 | 2 |
| POL |
|
| 2 | 2 | 2097.9 | -1039.9 | 2 | 2 |
| SPL1 | 548.2 | -266.4 | 2 | 2 | 2093.5 | -1039.6 | 2 | 2 |
| SPL2 | 546.9 | -265.8 | 2 | 2 |
| -1039.5 | 2 | 2 |
| LLDE0 | -270.9 | 3 | 9 | -1040.1 | 2 | 5 | ||
| LLDE1 | -268.5 | 3 | 7 | -1039.3 | 2 | 2 | ||
| LLDE2 | -267.2 | 2 | 2 |
| 2 | 2 | ||
Fig 5Old Faithful Geyser eruption data: Density estimates of the eruption duration (left) and waiting time (right) from various methods.
For the duration data the SPDE is the model with potential function U(y) = α1log y + α2(1/y) + α3 y + α4 y2; for the waiting time data it is the model SPL2 with 1 knot after knot deletion.