| Literature DB >> 34474689 |
Wodan Ling1, Ni Zhao2, Anna M Plantinga3, Lenore J Launer4, Anthony A Fodor5, Katie A Meyer6, Michael C Wu7.
Abstract
BACKGROUND: Identification of bacterial taxa associated with diseases, exposures, and other variables of interest offers a more comprehensive understanding of the role of microbes in many conditions. However, despite considerable research in statistical methods for association testing with microbiome data, approaches that are generally applicable remain elusive. Classical tests often do not accommodate the realities of microbiome data, leading to power loss. Approaches tailored for microbiome data depend highly upon the normalization strategies used to handle differential read depth and other data characteristics, and they often have unacceptably high false positive rates, generally due to unsatisfied distributional assumptions. On the other hand, many non-parametric tests suffer from loss of power and may also present difficulties in adjusting for potential covariates. Most extant approaches also fail in the presence of heterogeneous effects. The field needs new non-parametric approaches that are tailored to microbiome data, robust to distributional assumptions, and powerful under heterogeneous effects, while permitting adjustment for covariates.Entities:
Keywords: Heterogeneity; Microbiome differential abundance analysis; Type I error control; Zero-inflated quantile-based approach
Mesh:
Year: 2021 PMID: 34474689 PMCID: PMC8414689 DOI: 10.1186/s40168-021-01129-3
Source DB: PubMed Journal: Microbiome ISSN: 2049-2618 Impact factor: 14.650
Fig. 1Graphical illustration of the step-wise implementation of ZINQ. Step 1: Test of γ=0 by any valid test of logistic regression tells whether the variable of interest is associated with the presence-absence status of the taxon in samples. Step 2: Test of β(τ)=0 by the novel quantile rank-score test adjusting for zero inflation tells whether the variable of interest is associated with the difference at the τth percentile of the taxon’s non-zero measurements. The testing is conducted marginally on K selected quantiles of the non-zero abundance, such as the default grid. Step 3: Combine the marginal p-values in Steps 1 and 2 considering the dependence structure of the tests. Only when the aggregate p-value is significant, we conclude that the taxon is differentially abundant
Summary statistics of three important covariates of CARDIA in groups with / without HBP
| Without HBP | With HBP | ||
|---|---|---|---|
| 346 | 185 | ||
| Age | 55.12 (3.44) | 55.78 (3.39) | 0.034 |
| Physical activity score | 393.76 (307.74) | 263.84 (241.39) | < 0.001 |
| Dietary quality score | 3.43 (5.84) | 1.29 (5.39) | < 0.001 |
p-values are calculated by 2-sample t test.
Stratified count, mean and standard deviation of age, physical activity score and dietary quality score in groups with/without HBP
Type I error control and power on simulated data based on Anaerovorax’s normalized abundance
| Sample size = 600 | |||||
|---|---|---|---|---|---|
| Type I error | Power | ||||
| Null | Setting 1 | Setting 2 | Setting 3 | ||
| 0.05 | 0.01 | 0.05 | 0.05 | 0.05 | |
| Linear regression | 0.0547 | 0.0084 | 0.9949 | 0.6928 | 0.1247 |
| ZIP | 0.7387 | 0.6622 | 1.0000 ∗ | 0.9742 ∗ | 0.7720 ∗ |
| ZINB | 0.2019 | 0.0771 | 0.9812 ∗ | 0.7321 ∗ | 0.2398 ∗ |
| ZINQ-MinP | 0.0526 | 0.0106 | 0.9994 | 0.8557 | 0.1508 |
| ZINQ-Cauchy | 0.0580 | 0.0110 | 0.9991 | 0.8346 | 0.1493 |
| Linear regression | 0.0536 | 0.0088 | 0.9970 | 0.7425 | 0.1320 |
| ZIB | 0.0110 | 0.0017 | 0.9964 + | 0.6255 + | 0.0305 + |
| Tobit | 0.0543 | 0.0099 | 0.9989 | 0.8041 | 0.1467 |
| ZIlogN | 0.6992 | 0.6872 | 1.0000 ∗ | 1.0000 ∗ | 0.9999 ∗ |
| ZIG | 0.0548 | 0.0102 | 0.9961 | 0.7264 | 0.1196 |
| ZINQ-MinP | 0.0501 | 0.0101 | 0.9995 | 0.9096 | 0.1669 |
| ZINQ-Cauchy | 0.0503 | 0.0103 | 0.9994 | 0.8981 | 0.1555 |
| Linear regression | 0.0527 | 0.0113 | 0.9995 | 0.8934 | 0.1733 |
| Tobit | 0.0526 | 0.0110 | 0.9985 | 0.8597 | 0.1628 |
| ZIlogN | 0.0475 | 0.0095 | 0.9996 | 0.8794 | 0.1464 |
| ZIG | 0.0494 | 0.0096 | 0.9998 | 0.8850 | 0.1474 |
| ZINQ-MinP | 0.0501 | 0.0103 | 0.9993 | 0.8852 | 0.1520 |
| ZINQ-Cauchy | 0.0505 | 0.0095 | 0.9991 | 0.8735 | 0.1524 |
Setting 1: 100% from HBP edf for HBP samples;
Setting 2: 80% from HBP edf and 20% from non-HBP edf for HBP samples;
Setting 3: 60% from HBP edf and 40% from non-HBP edf for HBP samples.
∗: power of a method that inflates type I error
+: power of a method that deflates type I error
Results by the various methods on 10000 simulated datasets by generating samples from the edf of Anaerovorax’s normalized abundance, including type I error control and power under different settings with significance cutoffs 0.05 and 0.01
Type I error control and power on simulated data based on Saccharibacteria’s normalized abundance
| Sample size = 600 | |||||
|---|---|---|---|---|---|
| Type I error | Power | ||||
| Null | Setting 1 | Setting 2 | Setting 3 | ||
| 0.05 | 0.01 | 0.05 | 0.05 | 0.05 | |
| Linear regression | 0.0247 | 0.0032 | 0.0602 + | 0.0424 + | 0.0326 + |
| ZIP | 0.8238 | 0.7766 | 0.8056 ∗ | 0.7642 ∗ | 0.7384 ∗ |
| ZINB | 0.4241 | 0.2916 | 0.3515 ∗ | 0.3304 ∗ | 0.3150 ∗ |
| ZINQ-MinP | 0.0471 | 0.0089 | 0.9243 | 0.4867 | 0.0819 |
| ZINQ-Cauchy | 0.0506 | 0.0100 | 0.9166 | 0.5428 | 0.0954 |
| Linear regression | 0.0279 | 0.0030 | 0.0372 + | 0.0338 + | 0.0320 + |
| ZIB | 0.0053 | 0.0009 | 0.0649 + | 0.0190 + | 0.0067 + |
| Tobit | 0.0522 | 0.0137 | 0.0837 | 0.0703 | 0.0635 |
| ZIlogN | 0.9997 | 0.9997 | 0.9987 ∗ | 0.9978 ∗ | 0.9983 ∗ |
| ZIG | 0.0495 | 0.0073 | 0.1094 | 0.0675 | 0.0498 |
| ZINQ-MinP | 0.0428 | 0.0083 | 0.6626 | 0.2480 | 0.0596 |
| ZINQ-Cauchy | 0.0497 | 0.0099 | 0.6800 | 0.2818 | 0.0700 |
| Linear regression | 0.0500 | 0.0107 | 0.2021 | 0.1034 | 0.0541 |
| Tobit | 0.0498 | 0.0111 | 0.1677 | 0.0929 | 0.0533 |
| ZIlogN | 0.0446 | 0.0071 | 0.4933 | 0.2063 | 0.0621 |
| ZIG | 0.0443 | 0.0076 | 0.5563 | 0.2264 | 0.0643 |
| ZINQ-MinP | 0.0456 | 0.0085 | 0.8442 | 0.3766 | 0.0720 |
| ZINQ-Cauchy | 0.0497 | 0.0099 | 0.8327 | 0.3897 | 0.0773 |
Setting 1: 100% from HBP edf for HBP samples;
Setting 2: 80% from HBP edf and 20% from non-HBP edf for HBP samples;
Setting 3: 60% from HBP edf and 40% from non-HBP edf for HBP samples.
∗: power of a method that inflates type I error
+: power of a method that deflates type I error
Results by the various methods on 10000 simulated datasets by generating samples from the edf of Saccharibacteria’s normalized abundance, including type I error control and power under different settings with significance cutoffs 0.05 and 0.01
Average FPR and TPR by adjusted analysis on un-normalized/normalized simulated OTU tables
| Sample size = 531 | ||||||||
|---|---|---|---|---|---|---|---|---|
| 0.05 | 0.01 | 0.05 | 0.01 | 0.05 | 0.01 | 0.05 | 0.01 | |
| FPR | ||||||||
| Corncob | 0.1077 | 0.0498 | 0.0919 | 0.0400 | - | - | - | - |
| DESeq2 | 0.0921 | 0.0395 | 0.0779 | 0.0312 | - | - | - | - |
| EdgeR | 0.1034 | 0.0415 | 0.0893 | 0.0331 | - | - | - | - |
| LDM | 0.0501 | 0.0096 | 0.0499 | 0.0096 | 0.0501 | 0.0096 | 0.0489 | 0.0095 |
| Limma | 0.0561 | 0.0128 | 0.0719 | 0.0203 | - | - | - | - |
| Linear regression | 0.0475 | 0.0085 | 0.0469 | 0.0083 | 0.0472 | 0.0083 | 0.0488 | 0.0098 |
| MetagenomeSeq | - | - | - | - | - | - | 0.1539 | 0.0759 |
| Monocle | 0.7261 | 0.6695 | 0.6493 | 0.5839 | 0.0486 | 0.0086 | 0.0501 | 0.0102 |
| QRank | 0.0493 | 0.0101 | 0.0499 | 0.0100 | 0.0503 | 0.0098 | 0.0496 | 0.0099 |
| ZINQ-MinP | 0.0483 | 0.0094 | 0.0484 | 0.0096 | 0.0488 | 0.0096 | 0.0472 | 0.0091 |
| ZINQ-Cauchy | 0.0533 | 0.0107 | 0.0535 | 0.0106 | 0.0539 | 0.0104 | 0.0530 | 0.0109 |
| TPR | ||||||||
| Corncob | 0.4544 ∗ | 0.3093 ∗ | 0.4018 ∗ | 0.2615 ∗ | - | - | - | - |
| DESeq2 | 0.3289 ∗ | 0.2346 ∗ | 0.2859 ∗ | 0.1912 ∗ | - | - | - | - |
| EdgeR | 0.4046 ∗ | 0.2782 ∗ | 0.3653 ∗ | 0.2395 ∗ | - | - | - | - |
| LDM | 0.3283 | 0.1850 | 0.3094 | 0.1677 | 0.3283 | 0.1850 | 0.4150 | 0.2700 |
| Limma | 0.3981 | 0.2636 ∗ | 0.3701 ∗ | 0.2369 ∗ | - | - | - | - |
| Linear regression | 0.3358 | 0.1867 | 0.3030 | 0.1573 | 0.3214 | 0.1735 | 0.4080 | 0.2735 |
| MetagenomeSeq | - | - | - | - | - | - | 0.4900 ∗ | 0.3731 ∗ |
| Monocle | 0.8637 ∗ | 0.8275 ∗ | 0.8055 ∗ | 0.7579 ∗ | 0.3251 | 0.1766 | 0.4107 | 0.2761 |
| QRank | 0.3887 | 0.2346 | 0.2981 | 0.1641 | 0.3634 | 0.2160 | 0.3593 | 0.2117 |
| ZINQ-MinP | 0.4437 | 0.2733 | 0.3452 | 0.1945 | 0.4188 | 0.2535 | 0.4176 | 0.2519 |
| ZINQ-Cauchy | 0.4919 | 0.3156 | 0.3941 | 0.2333 | 0.4666 | 0.2943 | 0.4627 | 0.2919 |
∗: power of a method that inflates type I error
+: power of a method that deflates type I error
Results by the various methods on un-normalized/normalized simulated OTU table generated from the proposed two-part quantile model fitted on CARDIA data, including the average FPR and average TPR over 1000 runs according to significance cutoffs 0.05 and 0.01
Average FPR and TPR by unadjusted analysis on un-normalized/normalized simulated DM OTU tables
| Sample size = 531 | ||||||||
|---|---|---|---|---|---|---|---|---|
| 0.05 | 0.01 | 0.05 | 0.01 | 0.05 | 0.01 | 0.05 | 0.01 | |
| FPR | ||||||||
| Corncob | 0.0522 | 0.0115 | 0.0523 | 0.0115 | - | - | - | - |
| DESeq2 | 0.0954 | 0.0305 | 0.0951 | 0.0304 | - | - | - | - |
| EdgeR | 0.0588 | 0.0130 | 0.0580 | 0.0133 | - | - | - | - |
| LDM | 0.0494 | 0.0097 | 0.0493 | 0.0098 | 0.0494 | 0.0097 | 0.0499 | 0.0097 |
| Limma | 0.0493 | 0.0098 | 0.0493 | 0.0101 | - | - | - | - |
| Linear regression | 0.0475 | 0.0085 | 0.0475 | 0.0085 | 0.0475 | 0.0085 | 0.0496 | 0.0101 |
| MetagenomeSeq | - | - | - | - | - | - | 0.1354 | 0.0552 |
| Monocle | 0.9463 | 0.9296 | 0.9452 | 0.9283 | 0.0481 | 0.0087 | 0.0501 | 0.0102 |
| QRank | 0.0489 | 0.0098 | 0.0496 | 0.0096 | 0.0489 | 0.0097 | 0.0491 | 0.0096 |
| ZINQ-MinP | 0.0468 | 0.0092 | 0.0468 | 0.0091 | 0.0466 | 0.0090 | 0.0478 | 0.0092 |
| ZINQ-Cauchy | 0.0522 | 0.0108 | 0.0523 | 0.0106 | 0.0524 | 0.0107 | 0.0523 | 0.0108 |
| TPR | ||||||||
| Corncob | 0.3009 | 0.1678 | 0.3000 | 0.1675 | - | - | - | - |
| DESeq2 | 0.2210 ∗ | 0.1095 ∗ | 0.2207 ∗ | 0.1093 ∗ | - | - | - | - |
| EdgeR | 0.1603 | 0.0642 ∗ | 0.1610 | 0.0652 ∗ | - | - | - | - |
| LDM | 0.1554 | 0.0646 | 0.1553 | 0.0646 | 0.1554 | 0.0646 | 0.2775 | 0.1530 |
| Limma | 0.2923 | 0.1647 | 0.2918 | 0.1646 | - | - | - | - |
| Linear regression | 0.1529 | 0.0611 | 0.1528 | 0.0611 | 0.1528 | 0.0610 | 0.2888 | 0.1619 |
| MetagenomeSeq | - | - | - | - | - | - | 0.3165 ∗ | 0.1955 ∗ |
| Monocle | 0.9610 ∗ | 0.9485 ∗ | 0.9603 ∗ | 0.9476 ∗ | 0.1537 | 0.0616 | 0.2901 | 0.1630 |
| QRank | 0.2318 | 0.1156 | 0.2316 | 0.1152 | 0.2325 | 0.1162 | 0.2253 | 0.1088 |
| ZINQ-MinP | 0.2419 | 0.1244 | 0.2414 | 0.1236 | 0.2422 | 0.1242 | 0.2391 | 0.1194 |
| ZINQ-Cauchy | 0.2820 | 0.1511 | 0.2814 | 0.1506 | 0.2819 | 0.1514 | 0.2785 | 0.1449 |
∗: power of a method that inflates type I error
Results by the various methods on un-normalized/normalized simulated OTU table generated from the DM models fitted on CARDIA data, including the average FPR and average TPR over 1000 runs according to significance cutoffs 0.05 and 0.01
Fig. 2Type I error control of the various methods on permuted normalized CARDIA data. a Boxplots of type I error over 50 null rarefied data. b Boxplots of type I error over 50 null CSS normalized data
Numbers of differentially abundant taxa by valid methods on data normalized by rarefaction or CSS
| Rarefaction | CSS | |
|---|---|---|
| Corncob | 16 ∗ | – |
| DESeq2 | 34 ∗ | – |
| EdgeR | 33 ∗ | – |
| LDM | 11 | 23 |
| Limma | 24 | – |
| Linear regression | 5 | 25 |
| MetagenomeSeq | – | 40 ∗ |
| Monocle | 121 ∗ | 20 |
| QRank | 13 | 12 |
| ZINQ-MinP | 48 | 37 |
| ZINQ-Cauchy |
∗: method that inflates type I error
Fig. 3Venn diagrams of the differentially abundant taxa detected by ZINQ, QRank, and valid parametric methods that control type I error. a Results on rarefied data and b results on CSS normalized data
Fig. 4Empirical quantile functions (quantiles of normalized abundance (quantile) vs. quantile levels (τ)) stratified by with / without HBP for two typical taxa detected by ZINQ exclusively, with dashed horizontal lines indicating the two group means, which are close or identical in the examples. a Spindle shape, HBP is associated with lower Eubacterium abundance (rarefied) when the microbe is abundant. b Crossing, HBP is associated with lower Haemophilus abundance (CSS normalized) when the microbe is at a low level, but with higher abundance when the microbe is abundant
Fig. 5Heterogeneity comparison between the taxa detected by ZINQ exclusively and those found by the valid parametric methods that control type I error but not ZINQ on the CSS normalized data