| Literature DB >> 33967371 |
Abstract
Our goal is to estimate the true number of classes in a population, called the species richness. We consider the case where multiple frequency count tables have been collected from a homogeneous population, and investigate a penalized maximum likelihood estimator under a negative binomial model. Because high probabilities of unobserved classes increase the variance of species richness estimates, our method penalizes the probability of a class being unobserved. Tuning the penalization parameter is challenging because the true species richness is never known, and so we propose and validate four novel methods for tuning the penalization parameter. We illustrate and contrast the performance of the proposed methods by estimating the strain-level microbial diversity of Lake Champlain over 3 consecutive years, and global human host-associated species-level microbial richness.Entities:
Keywords: diversity; ecology; maximum likelihood; microbiome; regularization
Year: 2020 PMID: 33967371 PMCID: PMC8098713 DOI: 10.1080/02664763.2020.1754359
Source DB: PubMed Journal: J Appl Stat ISSN: 0266-4763 Impact factor: 1.404
The expected proportion of unobserved (k = 0), singleton (k = 1), rare (k = 1, 2, 3), and abundant ( ) species for 4 choices of η.
| ( | ( | ( | ( | ( | |
|---|---|---|---|---|---|
| Proportion unobserved ( | 0.787 | 0.501 | 0.316 | 0.891 | 0.933 |
| Proportion singletons ( | 0.072 | 0.050 | 0.032 | 0.009 | 0.009 |
| Proportion rare ( | 0.130 | 0.097 | 0.061 | 0.016 | 0.017 |
| Proportion abundant ( | 0.028 | 0.340 | 0.583 | 0.083 | 0.040 |
| Expected max abundance ( |
We also give the expected frequency count of the most abundant species when .
Figure 1.Estimates of C and their root-MSE over λ when and C = 1000. Results are based on 100 simulations per λ.
Figure 2.Estimates of C and their root-MSE over λ when and C = 1000. Results are based on 100 simulations per λ.
The penalized maximum likelihood estimate of C has lower RMSE for all investigated choices of η and r under a zero-truncated Gamma-mixed Poisson model for species abundances based on 100 simulations for each choice of η and r.
| RMSE ( | RMSE ( | |||
|---|---|---|---|---|
| 6 | 796.90 | 10 | 326.50 | |
| 10 | 527.53 | 15 | 286.27 | |
| 14 | 337.31 | 10 | 235.51 | |
| 6 | 735.95 | 15 | 516.37 | |
| 10 | 700.21 | 20 | 456.04 | |
| 14 | 666.55 | 20 | 470.35 | |
| 6 | 200.09 | 20 | 156.29 | |
| 10 | 213.22 | 55 | 142.64 | |
| 14 | 243.42 | 55 | 148.26 | |
| 6 | 401.17 | 55 | 147.13 | |
| 10 | 283.91 | 25 | 137.04 | |
| 14 | 415.19 | 70 | 126.72 |
is the value of λ which produced the lowest RMSE. is the estimate of C when , and is the estimate of C when .
RMSE for Methods 0–4 based on a zero-truncated gamma-mixed Poisson data generating process.
| Method 0: MLE (no penalization) | 709 | 787 | 775 | 716 | ||
| Method 1: Minimum subset variance | 630 | 578 | 763 | 821 | 796 | |
| Method 2: Cross-validated likelihood | 797 | 521 | 492 | 617 | ||
| Method 3: Goodness of fit | 707 | 781 | 663 | |||
| Method 4: Cross-validated g.o.f. | 812 | 571 | 533 | 738 | 787 | 679 |
C = 1000 is constant for all simulations. Under each combination, 100 simulations were run. Methods with RMSE better than Method 0 have a grey highlighting, and the best method for each combination is bolded.
Figure 3.Simulation results for all proposed methods when , and when .
RMSE for Methods 0 and 3 when counts are drawn from a gamma-mixed Poisson distribution with parameter η.
| Method | |||||||
|---|---|---|---|---|---|---|---|
| Method 0: MLE (no penalization) | 500 | 287 | 244 | 216 | 133 | 73 | |
| Method 3: Goodness of fit | 500 | 312 | 249 | 248 | 139 | 83 | |
| Method 0: MLE (no penalization) | 1000 | 418 | 266 | 277 | 158 | 127 | |
| Method 3: Goodness of fit | 1000 | 427 | 268 | 273 | 187 | 153 | |
| Method 0: MLE (no penalization) | 2000 | 463 | 419 | 372 | 367 | 346 | |
| Method 3: Goodness of fit | 2000 | 513 | 439 | 338 | 413 | 386 | |
| Method 0: MLE (no penalization) | 500 | 266 | 230 | 211 | 208 | 173 | |
| Method 3: Goodness of fit | 500 | 301 | 241 | 229 | 230 | 192 | |
| Method 0: MLE (no penalization) | 1000 | 485 | 430 | 429 | 365 | 372 | |
| Method 3: Goodness of fit | 1000 | 498 | 446 | 445 | 393 | 455 | |
| Method 0: MLE (no penalization) | 2000 | 908 | 775 | 781 | 719 | 787 | |
| Method 3: Goodness of fit | 2000 | 999 | 863 | 833 | 906 | 945 | |
| Method 0: MLE (no penalization) | 500 | 314 | 207 | 305 | 295 | 84 | |
| Method 3: Goodness of fit | 500 | 41 | 89 | 9 | 8 | 7 | |
| Method 0: MLE (no penalization) | 1000 | 701 | 438 | 500 | 375 | 227 | |
| Method 3: Goodness of fit | 1000 | 35 | 218 | 387 | 246 | 16 | |
| Method 0: MLE (no penalization) | 2000 | 1648 | 1143 | 766 | 743 | 998 | |
| Method 3: Goodness of fit | 2000 | 796 | 690 | 542 | 538 | 74 |
Results are based on 100 draws.
Figure 4.Simulation results for Methods 0 and 3 when , and . The distribution of is shown over 100 draws. The true value of C is indicated with a solid horizontal line.
RMSE for Methods 0 and 3 when counts are drawn according to a zero-inflated gamma-mixed Poisson distribution with C = 1000.
| Method | |||||||
|---|---|---|---|---|---|---|---|
| Method 0: MLE (no penalization) | 0.1 | 246 | 315 | 146 | 237 | 142 | |
| Method 3: Goodness of fit | 0.1 | 293 | 327 | 163 | 247 | 163 | |
| Method 0: MLE (no penalization) | 0.2 | 288 | 203 | 209 | 194 | 227 | |
| Method 3: Goodness of fit | 0.2 | 331 | 239 | 211 | 209 | 218 | |
| Method 0: MLE (no penalization) | 0.3 | 393 | 375 | 342 | 290 | 298 | |
| Method 3: Goodness of fit | 0.3 | 439 | 397 | 334 | 291 | 312 | |
| Method 0: MLE (no penalization) | 0.1 | 292 | 466 | 224 | 171 | 299 | |
| Method 3: Goodness of fit | 0.1 | 130 | 121 | 145 | 195 | 216 | |
| Method 0: MLE (no penalization) | 0.2 | 566 | 320 | 250 | 286 | 222 | |
| Method 3: Goodness of fit | 0.2 | 244 | 241 | 233 | 204 | 199 | |
| Method 0: MLE (no penalization) | 0.3 | 446 | 422 | 344 | 297 | 299 | |
| Method 3: Goodness of fit | 0.3 | 320 | 307 | 323 | 319 | 291 |
Results are based on 100 draws from the distribution where is a gamma-mixed Poisson distribution with parameters η.
Diversity estimates from the Lake Champlain data analysis from 2009 (r = 8), 2010 (r = 6) and 2011 (r = 6) using our proposed methods.
| 2009 | 2010 | 2011 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | ||||||||||||
| [0] Unpenalized MLE | 73,404 | — | 0.00088 | 0.00180 | 47,631 | — | 0.00185 | 0.00253 | 57,686 | — | 0.00161 | 0.00140 |
| [3] Goodness of fit | 20,160 | 550 | 0.00323 | 0.00174 | 13,156 | 225 | 0.00685 | 0.00257 | 40,040 | 230 | 0.00231 | 0.00137 |