| Literature DB >> 31882682 |
Dake Yang1, Jethro Johnson2, Xin Zhou2, Elena Deych1, Berkley Shands1, Blake Hanson3, Erica Sodergren2, George Weinstock4, William D Shannon5,6.
Abstract
Regressing an outcome or dependent variable onto a set of input or independent variables allows the analyst to measure associations between the two so that changes in the outcome can be described by and predicted by changes in the inputs. While there are many ways of doing this in classical statistics, where the dependent variable has certain properties (e.g., a scalar, survival time, count), little progress on regression where the dependent variable are microbiome taxa counts has been made that do not impose extremely strict conditions on the data. In this paper, we propose and apply a new regression model combining the Dirichlet-multinomial distribution with recursive partitioning providing a fully non-parametric regression model. This model, called DM-RPart, is applied to cytokine data and microbiome taxa count data and is applicable to any microbiome taxa count/metadata, is automatically fit, and intuitively interpretable. This is a model which can be applied to any microbiome or other compositional data and software (R package HMP) available through the R CRAN website.Entities:
Year: 2019 PMID: 31882682 PMCID: PMC6934614 DOI: 10.1038/s41598-019-56397-9
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Results of applying DM-RPart to the iHMP microbiome and cytokine data. 1(A) (top) shows the optimal tree found by fitting the full tree and using cross-validation pruning to find the best fitting tree. 1(B) (bottom) shows the taxa frequency composition differences for the 5 terminal nodes.
The complexity table of recursive partitioning on full iHMP data.
| Tree Complexity | Cost Complexity (α) | Number of Splits | Relative Error | Cross-validation Error |
|---|---|---|---|---|
| 1 | 0.0083 | 0 | 1.000 | 0.10704 |
| 2 | 0.0076 | 1 | 0.992 | 0.10893 |
| 3 | 0.0073 | 2 | 0.984 | 0.10792 |
| 5 | 0.0024 | 4 | 0.970 | 0.10508 |
| 6 | 0.0021 | 5 | 0.967 | 0.10827 |
| 8 | 0.0016 | 7 | 0.963 | 0.10903 |
| 9 | 0.0013 | 8 | 0.961 | 0.10948 |
| 10 | 0.0012 | 9 | 0.960 | 0.10928 |
| 11 | 0.0006 | 10 | 0.959 | 0.10907 |
| 12 | 0.0000 | 11 | 0.958 | 0.10932 |
Averaged cross-validation measure of the best tree of 100 iterations. Each cell shows the average measure of certain combination of the simulation parameters.
The complexity table of recursive partitioning on insulin sensitivity data.
| Tree Complexity | Cost Complexity (α) | Number of Splits | Relative Error | Cross-validation Error |
|---|---|---|---|---|
| 1 | 0.01113 | 0 | 1.0000 | 0.08750 |
| 2 | 0.00622 | 1 | 0.9889 | 0.09051 |
| 3 | 0.00583 | 2 | 0.9826 | 0.09077 |
| 5 | 0.00551 | 4 | 0.9710 | 0.09060 |
| 6 | 0.00353 | 5 | 0.9655 | 0.09045 |
| 7 | 0.00291 | 6 | 0.9620 | 0.09040 |
| 8 | 0.00223 | 7 | 0.9591 | 0.09051 |
| 9 | 0.00081 | 8 | 0.9568 | 0.09138 |
| 10 | 0.00048 | 9 | 0.9560 | 0.09196 |
| 11 | 0.00001 | 10 | 0.9555 | 0.09227 |
| 12 | 0.00000 | 11 | 0.9555 | 0.09230 |
Averaged misclassification error rate of the best tree of 100 iterations. The number in each cell is expressed as a percentage.
Summary of parameters in simulation studies.
| Parameters | Value |
|---|---|
| Sample size per body site | 40, 80, 120 |
| Expected taxa abundances per body site | |
| Dispersion parameter | 0.08, 0.2, 0.6 |
| Mean of covariate | Mean of G1 = −1, G2 = 0, G3 = 1 |
| Standard deviation of covariate | 0.2, 0.35, 0.5 |
Averaged misclassification error rate of the validation data of 100 iterations. The number in each cell is expressed as a percentage.
Averaged of mean squared error (E-03) of the best tree.
| Total sample size | SD = 0.2 | SD = 0.35 | SD = 0.5 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| θ | θ | θ | |||||||
| 0.08 | 0.2 | 0.6 | 0.08 | 0.2 | 0.6 | 0.08 | 0.2 | 0.6 | |
| 120 | 1.24 | 3.18 | 11.23 | 1.32 | 3.22 | 12.64 | 1.59 | 3.62 | 14.26 |
| 240 | 1.08 | 2.98 | 9.65 | 1.12 | 3.1 | 10.52 | 1.35 | 3.33 | 11.91 |
| 360 | 0.97 | 2.69 | 9.23 | 1 | 2.93 | 9.78 | 1.25 | 3.25 | 10.23 |
The complexity table of Gibbs-RPart on Parkinson’s disease data. The red row (row 4) indicates the best size tree with 6 terminal nodes.
Figure 2Simulated data to illustrate classical recursive partitioning.
Figure 3Optimal recursive partitioning tree fit to the simulated data in Fig. 2.
Figure 4Unpruned full recursive partitioning tree fit to the simulated data in Fig. 2.
The complexity table of recursive partitioning on simulated data.
| Tree Complexity | Cost Complexity ( | Number of Splits | Relative Error | Cross-validation Error |
|---|---|---|---|---|
| 0.314 | 0 | 1.000 | 1.000 | |
| 0.043 | 2 | 0.371 | 0.429 | |
| 0.007 | 4 | 0.286 | 0.514 | |
| 0.000 | 8 | 0.257 | 0.743 |
Pairwise p-value between terminal nodes. The diagonal of the table is one because the pairwise p-value of the terminal node itself is one.