Literature DB >> 34793531

Analysing microbiome intervention design studies: Comparison of alternative multivariate statistical methods.

Maryia Khomich^1,2, Ingrid Måge³, Ida Rud¹, Ingunn Berget³.

Abstract

The diet plays a major role in shaping gut microbiome composition and function in both humans and animals, and dietary intervention trials are often used to investigate and understand these effects. A plethora of statistical methods for analysing the differential abundance of microbial taxa exists, and new methods are constantly being developed, but there is a lack of benchmarking studies and clear consensus on the best multivariate statistical practices. This makes it hard for a biologist to decide which method to use. We compared the outcomes of generic multivariate ANOVA (ASCA and FFMANOVA) against statistical methods commonly used for community analyses (PERMANOVA and SIMPER) and methods designed for analysis of count data from high-throughput sequencing experiments (ALDEx2, ANCOM and DESeq2). The comparison is based on both simulated data and five published dietary intervention trials representing different subjects and study designs. We found that the methods testing differences at the community level were in agreement regarding both effect size and statistical significance. However, the methods that provided ranking and identification of differentially abundant operational taxonomic units (OTUs) gave incongruent results, implying that the choice of method is likely to influence the biological interpretations. The generic multivariate ANOVA tools have the flexibility needed for analysing multifactorial experiments and provide outputs at both the community and OTU levels; good performance in the simulation studies suggests that these statistical tools are also suitable for microbiome data sets.

Entities: Chemical

Mesh：

Year: 2021 PMID： 34793531 PMCID： PMC8601541 DOI： 10.1371/journal.pone.0259973

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

The microbiome has emerged as an important link to health and disease [1]. Microbiome analysis methods are rapidly advancing, in particular in areas such as compositional data analysis, multi-omics and data integration [2, 3]. A clear understanding of the type of data being analysed is crucial, given the growing number of studies uncovering the key role of microbiome, its composition and functions following diet intervention or medical treatment [4]. At present, analysis of complex microbial data benefits from adapting the multivariate statistical toolbox from ecology and environmental sciences, and a proper choice of statistical tools is becoming increasingly important [5-7]. However, a lack of benchmarking studies and clear consensus on the best multivariate statistical practices make comparisons across microbiome data sets difficult [2, 8]. New methods are often tested by simulation studies, but there is always a concern that simulations can be biased towards the tested statistical model and cannot mimic the complexity of real microbiome data [6, 9]. Moreover, newly introduced tools are often optimised, whereas the comparison of several statistical methods implies the use of standard or default parameters [6, 9]. It is therefore of interest to compare existing methods on real data sets of different complexity, in addition to simulation studies, to better understand how choice of method affects the results. Different statistical methods have different properties, and the choice of method should depend on the scientific question, experimental design, data characteristics and expected relationships among the variables. Furthermore, the choice of method is often biased by the research groups’ tradition and familiarity with specific “toolboxes”. Main differences between existing statistical approaches for analysing microbiome data are related to: (1) explorative versus confirmative; (2) univariate versus multivariate; (3) parametric versus nonparametric; (4) linear versus nonlinear; (5) compositional versus non-compositional; (6) distance-based versus count/abundance-based; and (7) incorporating phylogenetic information into the analysis or not [10-12]. Here, we explore statistical methods for analysing microbiome data from designed experiments with a focus on dietary intervention trials. In contrast to observational studies, these are usually small in sample size but performed in (semi)-controlled environments and tailored to a specific research hypothesis. The studies often include multiple experimental factors, possibly with more than two levels, and it is therefore natural to turn to analysis of variance (ANOVA)-like methods. Notably, most published analytical tools in microbiome research are essentially univariate [6], which led us to the conclusion that comparison of alternative multivariate statistical tools is sorely missing. From a biologist’s point of view, it is also important that the methods are easy to interpret, both at the multivariate (microbial community) and univariate (microbial taxa or operational taxonomic units, OTUs) levels (see Fig 1).

Fig 1

A diagram of statistical methods used in the study.

Distance-based methods

The distance-based methods are multivariate since multiple variables (microbial OTUs) are used to calculate pairwise distances between samples. Among distance-based methods, permutational multivariate analysis of variance (PERMANOVA) is the most widely used and more powerful than the analysis of similarities (ANOSIM) to detect changes in community structure [13-15]. Both methods may be implemented with any dissimilarity metric. Among abundance-based beta diversity indices, Bray-Curtis is the most common choice for count data [16, 17]. The most widely applied phylogenetic beta diversity indices are UniFrac-type metrics [17-19]. However, UniFrac is unsuitable as a distance metric for studies with a small sample size, which is usually the case for dietary intervention trials [20, 21]. Both PERMANOVA and ANOSIM test differences at the community level but do not provide any information at the OTU level. Similarity percentage analysis (SIMPER) works at the univariate level by computing the relative contribution of each analysed microbial taxon (i.e. OTU) to the overall average Bray-Curtis dissimilarities by pairwise comparison of two or more groups [15]. To the best of our knowledge, no such method exists for the other distance metrics. Distance-based methods have their strengths and weaknesses that are important to account for beforehand. ANOSIM cannot deal with multifactorial designs, and both ANOSIM and PERMANOVA may have problems detecting differences unless they are present in taxa with high variability [22]. Newer methods aimed at assigning more interpretable effect sizes are under constant development [6, 23]. For example, the more flexible PERMANOVA-S is an extension to existing distance-based methods that can adjust for covariates and simultaneously incorporate multiple distance metrics [24]. However, these methods do not consider covariance/correlation between microbial species, and they encounter significant power loss if all microbial species are used for distance calculations [25].

Abundance-based methods

The abundance-based methods can be either univariate (analysing each OTU individually) or multivariate (focusing on covariance structure between OTUs). There are two main approaches to deal with the special nature of abundance data: (1) application of methods that consider the distribution of count data or (2) compositional data analysis (CoDa) based on log-ratio transformed count data [26, 27]. Statistical methods designed for high-throughput sequencing data are ANOVA-like differential expression analysis (ALDEx2) [26], analysis of composition of microbiomes (ANCOM) [28], edgeR [29] and DESeq/DESeq2 [30]. edgeR and DESeq2 model count data directly using generalized linear models with the negative binomial distribution and the logistic link, respectively, whereas ALDEx2 and ANCOM use the log-ratio transformation prior to univariate assessment of statistical significance for individual OTUs. DESeq2 and edgeR are based on the same modelling approach but differ in normalisation, outlier handling, and other adjustable parameters; these methods had similar performance in simulation studies [30]. Thus, we decided to include only one of the methods—DESeq2—because differences between DESeq2 and edgeR are at a different conceptual level rather than the other methods discussed. ALDEx2 uses a Dirichlet-multinomial probability distribution to estimate abundances from count data and calculates the false discovery rate (FDR) based on Monte Carlo simulations (see Fig 3 in [26] for details). In ANCOM, the compositional nature of the data is considered by testing the log-ratio for all pairs of OTUs, and then counting the number of tests where the log-ratio is significantly different from zero. This number (W-stat) can be used to obtain a ranking of OTUs most likely to differ between the groups. The newly published ANCOM-BC corrects the bias induced by differences in sampling fractions and provides p-values and confidence intervals for the differential abundance of each OTU [31]. The FDR and power were shown to be similar for both ANCOM and ANCOM-BC, and therefore we limit the present study to ANCOM. The drawback of univariate methods is that they treat all taxa as independent variables without considering the covariance between the OTUs. Such methods may fail to detect community-level differences [32]. A classical generalisation of ANOVA to multiple variables (MANOVA) cannot be used when the number of variables exceeds the number of samples, as it suffers from the problem of a singularity of covariance matrices and assumptions that are not fulfilled [33, 34]. Novel statistical ANOVA-like methods include fifty-fifty multivariate analysis of variance (FFMANOVA) [35] and ANOVA-simultaneous component analysis (ASCA) [33]. Both methods are based on principal component analysis (PCA), and they can handle multiple collinear responses. In FFMANOVA, the multivariate effects are estimated by a modified variant of classical MANOVA, and OTU-level p-values are obtained by rotation tests which adjust the p-values for multiple testing [36]. For ASCA, the multivariate effects are calculated from combined sums-of-squares from all OTUs, and significance is assessed by permutation testing. ASCA also provides scores and loadings related to each experimental factor, which can be visualised in the same way as for PCA to better understand covariance patterns within the data. The contribution of each OTU can be quantified by the loadings or by partial least squares discriminant analysis (PLS-DA) for pairwise comparisons. ASCA has recently gained popularity in metabolomics [37-39], and both ASCA and FFMANOVA have successfully been applied to microbiome data [40-44]. Linear discriminant analysis effect size (LEfSe) is a stepwise approach that combines univariate analysis with multivariate discriminant analysis [45]. LEfSe has found wide application in microbiome research due to its easy to-use-and-interpret visualization [46, 47], but it is not adapted to experimental designs with several multilevel factors and is therefore not considered in this study.

Method comparison

An overview of the different methods compared in this study is given in Fig 1 and Table 1. ANOSIM and PERMANOVA provide results only at the community level, while SIMPER, DESeq2, ANCOM and ALDEx2 report results for single OTUs. ASCA and FFMANOVA are generic methods and the only methods that provide results at both the community and OTU level.

Table 1

An overview of statistical methods and their properties.

Method	Method name	Number of experimental factors allowed	Parametric	Multivariate	Univariate	Provides output at community level	Statistics for ranking OTUs	Reference
ALDEx2	ANOVA-like differential expression tool for high-throughput sequencing data	any	yes	no	yes	no	p-values or effect sizes	[26]
ANCOM	Analysis of composition of microbiomes	main factor + covariates	yes	no	yes	no	W-stat for the main variable	[28]
ANOSIM	Analysis of similarities	one	no	yes	no	yes	no	[15]
ASCA	ANOVA-simultaneous component analysis	any	yes	yes	no	yes	loadings or PLS-DA regression coefficients	[33]
DESeq2	Differential gene expression analysis based on the negative binomial distribution	any	yes (GLM)	no	yes	no	p-values or effect sizes (coefficients)	[30]
FFMANOVA	Fifty-fifty multivariate ANOVA	any	yes	yes	yes (rotation tests)	yes	p-values	[35]
PERMANOVA	Permutational multivariate analysis of variance	any	no	yes	no	yes	no	[14]
SIMPER	Similarity percentage	two-group comparison	no	yes	no	no	permutation p-values	[15]

ANOVA—analysis of variance; GLM—generalized linear model; OTU—operational taxonomic unit; PLS-DA—partial least squares discriminant analysis.

ANOVA—analysis of variance; GLM—generalized linear model; OTU—operational taxonomic unit; PLS-DA—partial least squares discriminant analysis. The aim of the method comparison was to investigate how different strategies for statistical modelling affect biological inference. At the community level, methods were compared with respect to effect sizes (expressed as percentage of explained variance) and corresponding p-values. At the OTU level, comparison of methods is complex because some methods provide results for an omnibus test of differences between factor levels (FFMANOVA and ANCOM), whereas the other methods provide ranking for specific pairwise comparisons (ASCA and PLS-DA, SIMPER) or contrasts/model coefficients (ALDEx2 and DESeq2). Even so, a biologist will make inferences based on the output provided by the chosen method, and in this context, it is relevant to compare the ranking statistics although the tests are not the same. In our study, the ranking of OTUs was compared by Spearman’s rank correlation and by investigation of scatterplots between the different ranking metrics. For the simulated data, where we know which OTUs are differentially abundant, True Positive Rate (TPR) and True Negative Rate (TNR) were also evaluated. We focused exclusively on designed experiments, which are usually smaller in sample size and are more controlled in contrast to observational studies. We used five published data sets as a basis for the comparisons (S1 Table). The following criteria for studies to be included were considered: (1) at least two-factorial experimental design with a minimum of two-factor levels; (2) either human or animal gut microbiome surveys; and (3) a taxonomic assignment at the OTU level reported. Diet is the main factor of interest in all five studies, and we restricted our comparisons to this factor. The simulated data were based on data set 1 (S1 Table) using the same study design and OTU counts as a starting point. Four different scenarios were simulated to investigate how the methods perform in situations with varying effect sizes and different numbers of differentially abundant OTUs. In all scenarios, one of the diet levels was manipulated to be significantly different from the others, and there was no effect of the second experimental factor (dose). See Methods section for further details.

Results

Community level

Four of the methods can be used to test the association between diet and the overall microbiome composition, the distance-based ANOSIM and PERMANOVA, and abundance-based ASCA and FFMANOVA. The results for each of the four simulated scenarios are shown in Fig 2, and the results across real data sets are summarised in Table 2.

Fig 2

Explained variance for simulated data and the relative number of simulations where the simulated effect was detected.

Numbers indicate the percentage of simulated data sets where p-value was significant.

Table 2

Community-level method comparison across five experimental data sets.

Factor/predictor	FFMANOVA^clr		ASCA^clr		PERMANOVA^clr		ANOSIM³^, ^clr		Factor/predictor
Factor/predictor	Effect size (explained variance), %	p-value¹	Effect size (explained variance), %	p-value²	Effect size (explained variance), %	p-value²	Effect size (explained variance), %	p-value²	Factor/predictor
Moen et al., 2016 (data set 1)							Moen et al., 2016 (data set 1)
Model: OTU ~ fiber*dose							Model: OTU ~ fiber_dose
fiber	34.31	< 0.001	37.39	< 0.001	37.26	0.001	82.19	0.001	fiber
dose	3.96	< 0.001	4.14	< 0.001	4.12	0.001			dose
fiber:dose	5.85	< 0.001	5.99	< 0.001	5.96	0.002			fiber:dose
residuals	55.75		52.69		52.67		17.81		residuals
Lai et al., 2018 (data set 2)							Lai et al., 2018 (data set 2)
Model: OTU ~ diet*exercise							Model: OTU ~ diet_exercise
diet	27.31	< 0.001	30.50	< 0.001	30.85	0.001	99.11	0.001	diet
exercise	12.18	< 0.001	14.08	< 0.001	13.99	0.001			exercise
diet:exercise	7.93	< 0.001	7.51	< 0.001	7.47	0.002			diet:exercise
residuals	52.31		47.74		47.70		0.89		residuals
Le Sciellour et al., 2018 (data set 3)							Le Sciellour et al., 2018 (data set 3)
Model: OTU ~ diet*period + subject							Model: OTU ~ diet_period
diet	1.78	< 0.001	1.99	< 0.001	2.14	0.001	13.33	0.001	diet
period	3.29	< 0.001	3.45	< 0.001	3.51	0.001			period
diet:period	1.32	< 0.001	1.41	0.006	1.36	0.009			diet:period
subject	26.53	< 0.001	26.42	0.073	27.28	0.046	86.67		subject
residuals	66.13		65.71		65.71		86.67		residuals
Wang et al., 2016 (data set 4)							Wang et al., 2016 (data set 4)
Model: OTU ~ diet + time + diet:time + subject							Model: OTU ~ diet_time
diet	2.49	0.067	2.11	0.038	2.27	0.036	4.32	0.143	diet
time	2.00	0.007	1.96	0.082	1.77	0.246			time
diet:time	5.15	0.824	4.41	0.777	4.45	0.701			diet:time
subject	50.44	< 0.001	54.23	< 0.001	64.30	0.001	95.68		subject
residuals	31.23		27.25		27.21		95.68		residuals
Birkeland et al., 2020 (data set 5)							Birkeland et al., 2020 (data set 5)
Model: OTU ~ treatment:day + subject							Model: OTU ~ treatment_day
treatment:day	1.75	< 0.001	1.27	0.107	1.27	0.132	-0.03	0.998	treatment:day
subject	69.38	< 0.001	73.85	0	73.85	0.001			subject
residuals	28.88		24.87		24.87				residuals

Distance-based ANOSIM and PERMANOVA and abundance-based ASCA and FFMANOVA were compared with respect to effect sizes (expressed as percentage of explained variance) and corresponding p-values.

1based on the 50–50 F-test, 999 permutations.

2based on 999 permutations.

3based on a combined factor with no interaction in the model (limitation of ANOSIM).

clrcentred log-ratio transformed data as input.

Explained variance for simulated data and the relative number of simulations where the simulated effect was detected.

Numbers indicate the percentage of simulated data sets where p-value was significant. Distance-based ANOSIM and PERMANOVA and abundance-based ASCA and FFMANOVA were compared with respect to effect sizes (expressed as percentage of explained variance) and corresponding p-values. 1based on the 50–50 F-test, 999 permutations. 2based on 999 permutations. 3based on a combined factor with no interaction in the model (limitation of ANOSIM). clrcentred log-ratio transformed data as input.

Simulated data

As expected, the multivariate effect size (explained variance) is lowest (around 5%) for the “Few-Low” scenario and highest (30–35%) for the “Many-High” scenario. The multivariate effect was significant in 100% of the simulations in three scenarios with the highest effect size. For the “Few-Low” scenario, FFMANOVA performed best by detecting the effect in 80% of the data sets. For the “Many” simulations explained variance was slightly higher with FFMANOVA than with ASCA and PERMANOVA, whereas for the “Few-High” simulations the opposite trend was observed. ANOSIM was less consistent than the other methods, with higher within-scenario variation and higher differences in effect size between the two “Low” scenarios. Data set 1. The effect of dietary fibres inulin (IN), cellulose (CE) or brewers spent grain (BSG) on the overall caecal microbiota composition in mice from a study by Moen et al. [42] accounted for 34–37% of the explained variance according to the FFMANOVA, ASCA and PERMANOVA. In general, the three methods produced similar results, with slightly smaller p-values by FFMANOVA and ASCA. Data set 2. Lai et al. [48] investigated the effect of diet (the main variable), exercise and their interaction on the overall faecal microbiota in sedentary and exercised mice fed high fat or normal fat diet (four groups in total). Similarly, all tested methods, except ANOSIM, produced congruent results (effect of diet 27–31%), with smaller p-values by FFMANOVA and ASCA. Data set 3. In a longitudinal study by Le Sciellour et al. [49] the authors tested the effect of dietary fibre content on faecal microbiota in growing-finishing pigs fed alternately a low-fibre and a high-fibre diet during four successive 3-week periods. In this survey, the effect of diet was small (2%) compared to the effect of diet in data sets 1 and 2. Similarly, FFMANOVA and ASCA reported slightly lower p-values than PERMANOVA, but all three methods agreed on the effect of diet. Data set 4. In a longitudinal study by Wang et al. [50] the objectives were to determine the impact of beta glucan on the composition of faecal microbiota in mildly hypercholesterolemic individuals. The individuals received for 5 weeks either a treatment containing 3 g high molecular weight (HMW), 3 g low molecular weight (LMW), 5 g LMW barley beta glucan or wheat and rice (control group) [50]. The effect of diet accounted for ~2% of the explained variance reported by FFMANOVA, ASCA and PERMANOVA. Diet was significant on a 5% level for PERMANOVA and ASCA (p = 0.036 and p = 0.038, respectively) and on a 10% level for FFMANOVA (p = 0.067). However, different conclusions were drawn with respect to time where significant result at the community level was obtained only for FFMANOVA (p = 0.007). Data set 5. Birkeland et al. [44] assessed the effect of prebiotic fibres or a control supplement on faecal microbiota composition in human subjects with type two diabetes. The interaction effect of treatment and day accounted for 1–2% of the explained variance according to the FFMANOVA, ASCA and PERMANOVA. Significant results at the community level were obtained only for FFMANOVA (p < 0.001). To summarise, PERMANOVA, FFMANOVA and ASCA gave almost identical results regarding effect sizes and statistical significance across studies (Table 2). They all revealed that there was a considerable difference in effect sizes of the main factor of interest (diet) between animal (2–37%) and human (1–2%) dietary interventions, with effect sizes being very small in human studies. In addition, three of the studies had crossover designs allowing for estimation of interindividual variation. This variation was considerably higher for trials involving human subjects (54–74%) compared to the animal study (26%). ANOSIM provided the most different results and was not able to reveal the same biological insights since the multifactorial nature of the studies cannot be taken into consideration by this approach.

OTU level

Six of the methods, namely SIMPER, ASCA, FFMANOVA, ANCOM, ALDEx2 and DESeq2 can be used to make biological inferences for individual OTUs. The methods give different outputs which can be used to identify differentially abundant OTUs and/or rank the OTUs according to effect sizes (see Methods for details). The True Positive Rate (TPR) for the four simulated scenarios is shown in Fig 3. The True Negative Rate (TNR) was close to 100% for all methods and is therefore not shown. ASCA provided the overall best results in terms of the TPR in all four scenarios. FFMANOVA, SIMPER and ANCOM were all highly sensitive in the scenario with few significant OTUs and a high effect size (“Few-High”). FFMANOVA was also highly sensitive in the scenario with many significant OTUs and a high effect size (“Many-High”); SIMPER and ANCOM performed best in the scenario with few significant OTUs with a lower effect size (“Few-Low”). Both ALDEx2 and DESeq2 detected very few OTUs in any of the scenarios and therefore had very low TPR.

Fig 3

Sensitivity (True Positive Rate) for the four scenarios in the simulation study.

Experimental data

Summary tables and scatterplots comparing ranking for different methods on the experimental data sets are given as S1 File and S1 Fig, respectively. The number of significant OTUs detected by the methods is summarised in S2 Table. As expected, many significant OTUs were discovered in the studies with large multivariate effect sizes (data sets Moen and Lai), whereas few OTUs were found in the studies with low multivariate effect sizes. The highest number of significant OTUs was identified by FFMANOVA, with almost twice the numbers detected by ALDEx2. ANCOM differed considerably between the study designs. ASCA recovered fewer OTUs than the other methods for the Moen data set, but more OTUs than the other methods for the other data sets.

Correlation between the methods

Agreement between the methods was investigated by calculating Spearman’s rank correlation between all pairs of output metrics (Fig 4). In the simulated data, FFMANOVA and ASCA had higher agreement than any other pair of methods, with correlations ranging from 0.6 to 0.75. In addition, the correlations were generally higher for scenarios with many differentially abundant OTUs for all methods. The results from the experimental data sets also showed that FFMANOVA and ASCA had highest agreement, with correlations ranging from 0.4 to 0.9. However, correlations varied considerably between the data sets. Highest agreements were observed for the animal studies, which are more controlled, and where interindividual variation is smaller compared to studies involving human subjects. The Moen and Wang data sets had a lower correlation between the methods than the other data sets for many comparisons, which could be explained by the fact that there are more than two levels of the diet factor, and omnibus tests will therefore not completely agree with the pairwise comparisons.

Fig 4

Spearman’s correlation (Y-axis) calculated for pairwise comparison of statistical methods (X-axis) for (A) simulated data and (B) five experimental data sets.

Each point represents Spearman’s rank correlation coefficient between OTU ranking metrics from the two methods compared.

Spearman’s correlation (Y-axis) calculated for pairwise comparison of statistical methods (X-axis) for (A) simulated data and (B) five experimental data sets.

Each point represents Spearman’s rank correlation coefficient between OTU ranking metrics from the two methods compared.

Impact of OTU abundance

Gut microbiome data are typically represented by a few dominant OTUs (relative abundance >1%) and a majority of low-abundant OTUs. One is therefore interested in ranking OTUs independent of their average abundance to detect biologically relevant changes in low-abundant OTUs. For all methods, except SIMPER, correlations in the range 0.1–0.4 were observed between the ranking statistics and the (log) mean relative abundance. However, it was not consistent between the data sets which method gave higher correlations. Correlations between the relative abundance and the ranking statistics were highest for the Moen data set and lowest for the Birkeland data set (S2 Table). ANCOM differed from the other methods as the ranking of OTUs was highly dependent on the abundance (Fig 5 and S1 Fig). In particular, the results indicated that highly abundant OTUs had either very high or very low W-stat, while low-abundant OTUs always had medium-to-low W-stat. To the best of our knowledge, this finding has not been reported before and shows that ANCOM is not able to identify changes in low-abundant OTUs. In the simulation study, clear differences between the scenarios with “Few” or “Many” differentially abundant OTUs can be observed (Fig 5A and 5C), and similar patterns are reflected for the experimental data (Fig 5B and 5D). It has been shown that power of ANCOM drops when the number of differentially abundant OTUs exceeds 25% [28]. In our “Many” simulations, 50% of the OTUs are differentially abundant, which justifies why ANCOM performed poorly for these scenarios. The dependency on the relative abundance can be more problematic also for the “Many” scenarios (Fig 5C) and data sets with many differentially abundant OTUs as, for instance, the Moen data set (Fig 5D).

Fig 5

Mean relative abundance (log-scale) plotted versus ANCOM W-stat.

(A) The “Few-Low” simulation scenario and (B) Birkeland data set, (C) “Many-Low” simulation scenario and (D) Moen data set. Red points (panels A and C) indicate differentially abundant OTUs.

Mean relative abundance (log-scale) plotted versus ANCOM W-stat.

(A) The “Few-Low” simulation scenario and (B) Birkeland data set, (C) “Many-Low” simulation scenario and (D) Moen data set. Red points (panels A and C) indicate differentially abundant OTUs.

Discussion

Diet is considered to be an important driver of microbiome variation [51]. However, in observational population-based studies, diet consistently accounts for only a small proportion of microbiome variation, and this is partly due to large interindividual differences in microbiome composition, small sample sizes and limitations in study designs such as potentially insufficient washout periods in crossover studies [51-55]. In general, higher interindividual variation is observed in gut microbiome of human subjects compared to animal species [56]. This was confirmed by our results, and it can partly explain the lower diet related effect on the gut microbiomes in the two human studies. Variation in how much a diet can influence the microbiome is also dependent on the nutritional differences of the compared diets. Nevertheless, use of animal models to study a causal role of the gut microbiome in health and disease is an established practice although animal models lack the specific interactions present in the complex system of a human organism [57]. Univariate and multivariate analyses provide information at different levels. Biologists will often find that the outputs of univariate analyses are easier to interpret compared to those generated by multivariate analyses, though assumptions are similar for both method types [10]. In general, multivariate methods provide a more holistic overview of differences between samples and account for correlations and interactions between the variables, whereas univariate methods are well suited to point out the differences for specific microbial groups. Therefore, the two levels of analysis provide complementary information, and it is generally of biological interest to report differences at both levels. FFMANOVA and ASCA consider the covariance between all OTUs. It could therefore be expected that these methods would be better at detecting scenarios with many differentially abundant OTUs due to the consistency at large phenomenon, i.e. that many OTUs carry the same information and such methods collectively are able to detect small effects that are not significant at the univariate level [58, 59]. At the community level, these methods performed similarly to the distance-based methods. At the OTU level, they had considerably higher sensitivity to identify true positive OTUs than the other methods in the “Many” OTU scenarios. FFMANOVA and ASCA depend on the relative scaling of the OTUs, while for the distance-based methods it depends on the chosen distance measure. It is a common practice in many areas to scale all variables to equal variance thus giving them an equal weight in the model, but other options are also possible, see, for instance, van den Berg et al. [60]. The clr-transformation puts the variables at comparable levels, and the need for scaling is less obvious. However, the highly abundant OTUs might still have slightly higher variance and scaling should be considered depending on the data characteristics and the biological interpretation. With scaling, all OTUs have the same contribution in the analysis, whereas without scaling the more abundant OTUs will dominate the analyses and the inferences will be related to the more abundant OTUs. It is a known fact that microbial sequence data are zero-inflated, and rare OTUs should be removed prior to downstream statistical analyses. We have observed that the threshold for filtering out OTUs can significantly affect the results both at the community and OTU levels. This can be exemplified in the human Birkeland data set, where stricter OTU-filtering performed in the present study resulted in no significant treatment effect by ASCA, in contrast to the original publication [44]. The tools tested in the present study vary in flexibility. SIMPER allows only pairwise comparisons and ANOSIM provides multigroup analysis, but it is not suited for multifactorial study designs. The other methods can employ more complex models with multiple factors with varying number of levels, and corresponding interactions. In experiments with repeated measurements, the subject effect is often included as a random factor. Neither of the methods discussed here can do this, hence the subject effect was included as a fixed factor. These aspects should be considered when selecting methods because different study designs might require different types of statistical models and tests. Newer developments of ASCA, namely ASCA+ [61] and LiMM-PCA [62], increase flexibility when there are unbalanced designs or random (i.e. subject) effects, respectively. Even so, the longitudinal modelling of large microbiome data sets in combination with multiple covariates is only starting to emerge [6]. Although Spearman’s rank correlation indicated good agreement for the animal studies (Fig 4), little overlap between lists of “significant” results could be detected (results not shown). There can be several reasons for this. One important aspect is that different criteria must be used to define “significance” or generally “importance”. FFMANOVA, SIMPER, DESeq2 and ALDEx2 provide p-values, while more heuristic tools must be applied with ASCA and ANCOM. For some methods, such as FFMANOVA and ANCOM, the ranking is related to all levels of the experimental factors, whereas the other methods use only pairwise comparisons (SIMPER) or comparison against a selected control/reference level (ALDEx2 and DESeq2). ASCA provides a test related to all levels at the community level, while the PLS-DA step relates to pairwise comparisons at the OTU level. This will introduce a bias for data sets with multilevel factors where more than one level is different from each other. Nevertheless, we chose to compare these rankings as this is the output that is available to the user. In the simulations only one level of one factor was designed to have an effect, hence the “omnibus” and the “specific” tests are more directly comparable. In experiments with multilevel factors, additional information can be obtained by looking at the clr-difference between the groups of interest in addition to the ranking statistics. Moreover, differences in sample collection, sample preparation and sequencing contribute to additional variability, which, in turn, affects the validity of the results [63] and complicates comparisons across studies with similar interventions. Past benchmarking studies [9, 64, 65] have reported varying results from different tools, which was also confirmed in our study. Currently, there is no consensus for the best existing tool for detecting differentially abundant microbial taxa, and there is no reason to believe that one single method is best in all cases. Based on our simulations, the generic multivariate tools, ASCA and FFMANOVA, performed best. We anticipate that these results will inform future studies with more complex settings where more than one factor has an effect or interactions between experimental factors are included in the model. In addition to performance, ease of use is an important aspect when selecting the appropriate tool. FFMANOVA and ASCA are based on standard statistical tools, namely PCA and ANOVA. Some of the tools designed for microbiome high-throughput sequencing data, on the other hand, can be difficult for non-statisticians, and it can be questioned if the users are able to interpret all parameters correctly, even if the methods are supported by comprehensive documentation. It is always good scientific practice to compare and report outputs from several methods. There are no standards on how to report multiple modelling results, and there is a high risk of “fishing for significance” when several methods are applied [66]. Before designing the experiment, researchers should be aware of the different properties of the statistical methods and consider whether it is most important not to miss out on any possible findings or to obtain robust results. In the latter case, OTUs should be reported as differentially abundant only if they were flagged as “significant” by several methods [66].

Conclusion

In the present study, we compared the performance of several multivariate ANOVA-like statistical methods taking four simulated scenarios and five real dietary intervention microbiome data sets as examples. At the community level, all the different methods came to similar conclusions; at the OTU level, the agreement between the methods considerably varied. ANCOM provided output metrics that were dependent on the average abundance, making it impossible to detect differences in low-abundant OTUs. At the OTU level, the ranking of OTUs obtained with different methods correlated better for animal studies than for human studies, possibly due to lower interindividual variation in animal studies. Based on the simulation results we advise applying FFMANOVA and ASCA for overall and pairwise comparisons of microbiome data, respectively, also because these methods provide output at both the community and OTU levels, can handle several design factors, as well as other data types common in microbiome research.

Methods

Experimental design and data characteristics

The dietary intervention data sets were selected based on the study design, with a minimum of two independent variables (for example, diet and dose; see S1 Table for details). Prior to the statistical analyses, the data were filtered to keep the OTUs that were present: (1) with relative abundance more than 0.005% in an individual, and (2) in at least 50% of the individuals and in one of the groups. For each study in total 507, 561, 560, 397 and 216 OTUs passed this filter and were subsequently used in downstream analyses (S1 Table and S2 File). All statistical analyses were performed in R version 4.1.0 [67] unless otherwise specified and were run in 999 permutations. Simulated data were generated using the R package metaSPARSim [68] developed for simulating 16S rDNA data. The simulation was a two-step process: (1) modelling the abundance (expected counts for each experimental group) using a Gamma distribution; (2) modelling the within-group variability using a Multivariate Hypergeometric (MHG) distribution taking the output from step 1 as input parameters for the distribution. In addition, metaSPARSim contains functions for parameter estimation from observed data. We used data set 1 [42] (S1 Table) to estimate real starting parameters for the simulations. Raw counts were generated for the same number of OTUs as in the experimental data set, and subsequently pre-processed and analysed in the same manner as for the experimental data (described below). Four different scenarios were simulated, with a varying number of the differentially abundant OTUs (“Few” versus “Many”) and the effect sizes (“Low” versus “High”). 100 data sets were generated and analysed for each of the four scenarios: : 10 randomly selected OTUs were assigned random log2 fold changes from a uniform distribution with boundaries [3,4]. : 10 randomly selected OTUs were assigned random log2 fold changes from a uniform distribution with boundaries [8,9]. : 254 randomly selected OTUs were assigned random log2 fold changes from a uniform distribution with boundaries [3,4]. : 254 randomly selected OTUs were assigned random log2 fold changes from a uniform distribution with boundaries [8,9]. Zero-value replacements were done prior to clr-transformation [27, 69] by applying function cMultRepl with a setting method = ‘CZM’ in the R package zCompositions [70]. Zero replacement is an ongoing and yet unsolved statistical problem in microbiome research, and newer methods are constantly being developed and applied to both simulated and experimental data sets [71-74]. ANOSIM and PERMANOVA were run using the functions anosim and adonis in the R package vegan [75], with a setting method = Euclidean and the clr-transformed data as input. SIMPER was run on filtered relative abundance data using function simper from the R package vegan [75]. The p-values for each OTU between selected pairwise comparisons of diet levels were used as a ranking metric; the p-value represents the probability of getting a larger contribution to the Bray-Curtis dissimilarity in a random permutation of the group factor. FFMANOVA was performed by using the function ffmanova implemented in the R package ffmanova [35] on the clr-transformed and standardised data. Raw p-values were used as a ranking metric. ASCA was run on the clr-transformed and centred data using an in-house implementation in MATLAB (R2018b, The MathWorks Inc.); p-values at the community level were calculated by permutation tests (n = 999). PLS-DA models [76] with the ASCA diet effect matrix, residuals as a predictor and factor levels as a response were used for pairwise comparison of diet levels. Variable Importance in Prediction (VIP) values [76] were used to identify significant OTUs, and the VIP threshold was set using the Uninformative Variable Elimination (UVE) method [77]. The UVE procedure was repeated 100 times, and OTUs that were above the threshold in >95% of the repetitions were defined as significant. For study designs with two diet levels, the ASCA loadings were used to rank OTUs, while PLS-DA regression coefficients were used for pairwise comparisons of multilevel diet factors. ANCOM was run by using the ANCOM 2.0 source code implemented in R [28] using filtered relative abundance data as input. The W-stat was used to rank OTUs indicating the number of significantly different pairwise log-ratios while adjusting for FDR by applying a Benjamini-Hochberg correction at a 0.05 level of significance. ALDEx2 was run using the functions aldex.clr and aldex.glm from the R package ALDEx2 v.1.18.0 [26]. We used raw counts as input and p-values for selected factor level contrasts to rank OTUs. DESeq2 was run using the functions DESeqDataSetFromMatrix (to generate object), DESeq (for analysis) and results (to extract results) from the Bioconductor package version 1.32.0 [30]. We used raw counts as input and raw p-values for selected level contrasts to rank OTUs (default settings).

Factor level comparisons

For study designs with more than two diet groups, only one pairwise comparison was analysed for the methods based on the pairwise group comparison, namely ALDEx2, ASCA and SIMPER. The following pairs with the most contrasting outcomes were compared: (1) BSG group vs. IN group [42]; (2) control group vs. 3 g HMW [50] and (3) Placebo 6 weeks group vs. Fibre 6 weeks group [44]. Differentially abundant OTUs were identified by setting thresholds on the ranking metrics for each method. For the methods providing p-values, the threshold was set to 0.01. For ANCOM, the 60th percentile of the empirical distribution of the W-stat was used as a threshold. For ASCA, OTUs selected in 95 out of 100 UVE-runs were identified as differentially abundant. True Positive Rate (TPR) and True Negative Rate (TNR) were calculated for the simulated data. TPR (also called Power or Sensitivity) is calculated as TP/P, where TP is the number of true differentially abundant OTUs identified by a statistical method and P is the number of differentially abundant OTUs defined in the simulation setup. The TNR (also called Specificity) is calculated as TN/N, where TN is the number of true non-differentially abundant OTUs identified by a statistical method and N is the corresponding number defined in the simulation setup.

Pairwise scatterplots between OTU ranking metrics for each of the statistical methods.

The (A) Moen, (B) Lai, (C) Le Sciellour, (D) Wang and (E) Birkeland data sets. (PDF) Click here for additional data file.

Explained variance for simulated data and the number of simulations where the simulated effect was detected.

Differentially abundant OTUs are shown in red and non-differentially abundant OTUs in black. (PDF) Click here for additional data file.

An overview of the experimental studies and their data characteristics.

(XLSX) Click here for additional data file.

Number and per cent of significant OTUs detected by each method.

(XLSX) Click here for additional data file.

Summary table of the results obtained for the OTU level across five experimental data sets for the methods SIMPER, ASCA, FFMANOVA, ANCOM, ALDEx2 and DESeq2.

(XLSX) Click here for additional data file.

Filtered relative abundance data and metadata for five experimental data sets.

(XLSX) Click here for additional data file.

RData file with raw counts, relative abundance and clr-transformed and filtered data for the Moen data set.

(RDATA) Click here for additional data file.

R script for the Moen data set analyses.

(R) Click here for additional data file.

R script for the simulated data analyses.

(R) Click here for additional data file. (DS_STORE) Click here for additional data file. (RHISTORY) Click here for additional data file. 19 Apr 2021 PONE-D-21-07978 Analysing microbiome intervention design studies: Comparison of alternative multivariate statistical methods PLOS ONE Dear Dr. Khomich, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by May 30 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Lingling An Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf Please include captions for your Supporting Information files *at the end of your manuscript*, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Partly Reviewer #3: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: No Reviewer #3: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: In this article, the authors conducted a detailed comparison between different statistical methods for microbiome data analysis. The comparison is based on some dietary intervention studies. I enjoy reading this paper and think such a comparison is needed. But there is still a large room to improve in numerical comparison and writing. See details in attachment. Reviewer #2: In Khomich et al., the authors compared a variety of distance-based and OTU-level microbiome testing methods, on five human- and animal-based dietary studies. The authors found that, on these datasets, the distance-based methods largely agree regarding both effect size and statistical significance, whereas large discrepancies exist for OTU-level testing methods. I found the authors investigation here helpful, especially with the focus on dietary intervention studies. The empirical evaluation of the methods’ performance will provide helpful guidance for researchers in the field. However, I do find the evaluation efforts limited, in order to make rigorous claims comparing across methods. I think additional simulation and real dataset based analysis will significantly boost the paper’s usefulness to interested readers. My comments are detailed below. Major 1. Most methods evaluation efforts, including those published for microbiome data, include simulation sections. The authors correctly pointed out that simulation evaluation has its own caveats, such as the data simulation model favoring some evaluated methods over others. However, one of the most important benefits of simulated datasets is that the ground truth (associated OTUs, effect sizes, etc.) are known, which is impossible for real-world datasets. This importantly enables rigorous evaluation of statistical performance, such as power and false positive rates, as consistent metrics across methods. In order to avoid favoring one of the evaluated methods by using a similar simulation model, nonparametric methods such as shuffling the data or combining different environments can be considered. See e.g. PMID 24699258, or the Thorsen et al. paper the authors cited in the work. I would strongly advise the addition of such efforts, as they can provide valuable insights into actual performance of between methods, as the current claims are limited to e.g. “A and B methods are more similar than C method”. The latter is still valuable information, but it does not necessarily inform the reader as to if A/B are preferable to C. 2. Relatedly, there is space for improvement in the real-world data based evaluations. I think one of the strong suits of the authors’ efforts is the inclusion of a variety of dietary intervention studies, with different study designs and host environments. However, the authors claims are limited to which methods performed more similarly in which datasets, using also limited evaluation metrics (p-value and effect size for dissimilarity based analysis, feature rank Spearman correlation for per-OTU analysis). Are the reported percentage variability explained largely consistent with previous literature? Are the important features identified by each method actually biologically plausible? Answering these questions can provide important insight on the validity of each method. Moderate: 1. I’m not sure if the comparison for Figure 2 was carried out entirely correctly. That is, I wonder if the authors might have rankings that are not comparable between studies. E.g. line 304-306: “Also note that for some methods (FFMANOVA, ALDEx2 and ANCOM), the ranking is related to all levels of the experimental factors, whereas the other methods use pairwise comparisons only.” Pairwise comparison and all-level comparison are essentially different tests, and rankings using their test statistics are not directly comparable. The authors should be careful to compare the same tests across methods. 2. The authors claimed LEfSe can only be compared against other methods in two datasets, as the other datasets report too few significant results. However, results for all features can be reported by LEfSe, by setting the most lenient effect size and p-values thresholds. Focusing on datasets with only significant features can artificially inflate LEfSe’s agreement with other methods, as significant features should be definition have more confidence in their effect sizes. Minor: 1. I wonder if the visualization for Figure 2 can be improved. The color can be arranged for animal-based and human-based studies to be more similar among each other. Alternatively, the authors might instead consider plotting five method-against-method heat maps, one for each study, to make the current Y-axis less dense. Reviewer #3: Khomich et. al. provided a review and method comparisons of existing microbiome analysis tools for analyzing intervention design studies. The paper is well-written. They benchmarked four datasets. The analysis was well-conducted. I have some suggestions below: 1. Not very clear to me why intervention design studies were different from there microbiome studies with categorical covariates. 2. I would appreciate the authors to review the statistical assumptions that are adopted in each of these methods. The review of the methods is very descriptive. For example: “Among abundance-based beta diversity indices, Bray-Curtis is a common choice due to its theoretical properties and empirical accuracy. Widely applied phylogenetic beta diversity indices is UniFrac-type metrics. However, UniFrac is unsuitable as a distance metric for studies with 84 small sample size, which is usually the case for dietary intervention trials21,22.” What does theoretical properties mean? Statistical properties? What does “Empirical accuracy” refer to? Since this is a rather method paper, author should at least explain the method framework and how the advantages and disadvantages related to different methods. 3. There are many places that authors commented on correlation between taxa, but they did not discuss about in their four real datasets, how correlated the taxa are? Did they on purposely choose the ones contain different level of correlation to distinguish results? 4. Bioinformatic pipeline and data quality that used to pre-processing can make differences for the downstream analysis for microbiome studies. Any suggestions on statistical tools which can be better/worse in these situations? ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. Submitted filename: Comments.pdf Click here for additional data file. 14 Sep 2021 INTRO Reviewer #1: In this article, the authors conducted a detailed comparison between different statistical methods for microbiome data analysis. The comparison is based on some dietary intervention studies. I enjoy reading this paper and think such a comparison is needed. But there is still a large room to improve in numerical comparison and writing. See details in attachment. The comments from the attached .pdf-file are copied and answered below. Thank you for your interest in our work, comprehensive and constructive comments that have helped us to improve the current version of the manuscript. A detailed response to all comments is provided below. Several methods suggesting the same conclusion doesn’t imply that they are reliable. All methods may be wrong in the same way! It might be better to benchmark the analysis with some special experiment design or simulation study on real data. We have conducted a simulation study imitating four different scenarios to investigate how the tested statistical methods perform in situations with varying effect sizes and different numbers of differentially abundant OTUs. The simulated data were based on data set 1 (Moen et al., 2016) using the same study design and OTU counts as a starting point. At the community level, the methods strongly agreed and there was very low variability between simulations on the estimated explained variance. At the OTU level, the ASCA and FFMANOVA had the best performance in terms of True Positives and False Negatives. We have added the description and outcomes of the simulations throughout the revised manuscript. We have also made new figures visualising the outcomes of the simulations. In the OTU level comparison, several other popular differential abundance tests should be also included into the comparison. For example, DESeq2 (Love et al., 2014), MetagenomeSeq (Paulson et al., 2013), ANCOM-BC (Lin and Peddada, 2020), DR (Morton et al., 2019). We have carefully checked the available literature on the differential abundance (DA) methods and selected a subset for our study. The main novelty of our work lies in comparing the less known and more general methods FFMANOVA and ASCA to more established methods designed for microbiome data. Many of the available methods, such as ANCOM (Mandal et al., 2015), DESeq2 (Love et al., 2014), edgeR (Robinson et al., 2010) and MetagenomeSeq (Paulson et al., 2013) have extensively been compared (Weiss et al., 2017; Lee et al., 2017). Both studies highlighted ANCOM as the best method regarding sensitivity and false discovery rate (FDR). Therefore, in our study we included ANCOM as the best performing method. But our first submission was done before ANCOM-BC (Lin and Peddada, 2020) was published. The paper shows that ANCOM and ANCOM-BC have comparable FDR and power, and we therefore think that our results on ANCOM are representative also for ANCOM-BC. Also, the scope of our paper is limited to methods suited for analysing multifactor experimental designs. MetagenomeSeq (Paulson et al., 2013) and DR (Morton et al., 2019) are developed for one-factor experiments. Because of this, we have also chosen to exclude LEfSe from our revised manuscript. Based on the study by Love et al. (2014), DESeq2 and edgeR use a similar modelling strategy (negative binomial) and perform similarly in simulations, but differ in normalisation, outlier handling, and other adjustable parameters. Thus, we decided to include only one of the methods – DESeq2 – because differences between DESeq2 and edgeR are at a different conceptual level rather than the differences between, for instance, ASCA and these methods. In distance-based methods, the choice of distance is very important. It would be great to compare different choices of distance. We know that the choice of distance is very important in microbiome surveys. However, this aspect of analysis was beyond the scope of this study. We chose Bray-Curtis distance because it is the most used beta-diversity metric, and the first author of the manuscript has previously published a study where the performance of four beta-diversity metrics, i.e., Bray-Curtis, Jaccard, Gower and Raup-Crick (as implemented by the “bray”, “jaccard”, “gower” and “raup” options for the vegdist function in R package vegan), was compared on a set of microbiome data in eight different variations using non-metric multidimensional scaling (NMDS) ordination with two dimensions (Supplementary Figure S4; Khomich et al., 2017: https://doi.org/10.1016/j.funeco.2017.01.008). Assessment of metrics’ validity was done by Procrustes correlation tests run in 999 permutations (function procrustes in R package vegan). Jaccard, Raup-Crick and Bray-Curtis dissimilarity indices produced very similar results, with Gower being the least robust metric. In abundance-based methods, phylogenetic tree information can also be incorporated (Morton et al., 2017; Wang et al., 2020). They should be cited. Thank you for your suggestion. We cited these papers in our revised manuscript (see lines 67-68, second paragraph of Introduction). In most of abundance-based methods, the zeros are usually replaced by some small positive constant. This is a very important problem in microbiome data analysis. Currently, the authors only consider one imputation method. It would be better to include a separate experiment to compare different ways to handle zeros. Some studies (Costea et al., 2014) suggests that handling zeros can affect the analysis a lot. Some recent statistical methods (Brill et al., 2019; Wang et al., 2021) are designed for zero problem should be at least cited. Zero imputation is an ongoing and yet unsolved statistical problem in microbiome research, and newer statistical methods are constantly being developed and applied to both simulated and experimental data sets. We agree with the Reviewer that zero imputation can impact the results and therefore should be considered, but it was beyond the scope of this study. We cited the suggested literature in our revised version of the manuscript (see lines 470-473). Reviewer #2: In Khomich et al., the authors compared a variety of distance-based and OTU-level microbiome testing methods, on five human- and animal-based dietary studies. The authors found that, on these datasets, the distance-based methods largely agree regarding both effect size and statistical significance, whereas large discrepancies exist for OTU-level testing methods. I found the authors investigation here helpful, especially with the focus on dietary intervention studies. The empirical evaluation of the methods’ performance will provide helpful guidance for researchers in the field. However, I do find the evaluation efforts limited, to make rigorous claims comparing across methods. I think additional simulation and real dataset-based analysis will significantly boost the paper’s usefulness to interested readers. We agree with the Reviewer and have conducted simulation studies imitating four different scenarios. The simulated data were based on data set 1 (Moen et al., 2016) using the same study design and OTU counts as a starting point. At the community level, the methods strongly agree and there is very low variability between simulations on the estimated explained variance. At the OTU level, the ASCA and FFMANOVA had the best performance in terms of True Positives and False Negatives. We added the description and outcomes of the simulations throughout the revised version of the manuscript. We have also made new figures depicting the simulation results. MAJOR 1. Most methods evaluation efforts, including those published for microbiome data, include simulation sections. The authors correctly pointed out that simulation evaluation has its own caveats, such as the data simulation model favoring some evaluated methods over others. However, one of the most important benefits of simulated datasets is that the ground truth (associated OTUs, effect sizes, etc.) are known, which is impossible for real-world datasets. This importantly enables rigorous evaluation of statistical performance, such as power and false positive rates, as consistent metrics across methods. To avoid favoring one of the evaluated methods by using a similar simulation model, nonparametric methods such as shuffling the data or combining different environments can be considered. See e.g., PMID 24699258, or the Thorsen et al. paper the authors cited in the work. I would strongly advise the addition of such efforts, as they can provide valuable insights into actual performance of between methods, as the current claims are limited to e.g. “A and B methods are more similar than C method”. The latter is still valuable information, but it does not necessarily inform the reader as to if A/B are preferable to C. We agree with the Reviewer and have conducted simulation studies imitating four different scenarios. The simulated data were based on data set 1 (Moen et al., 2016) using the same study design and OTU counts as a starting point. At the community level, the methods strongly agree and there is very low variability between simulations on the estimated explained variance. At the OTU level, the ASCA and FFMANOVA had the best performance in terms of True Positives and False Negatives. We added the description and outcomes of the simulations throughout the revised version of the manuscript. We have also made new figures depicting the simulation results. 2. Relatedly, there is space for improvement in the real-world data-based evaluations. I think one of the strong suits of the authors’ efforts is the inclusion of a variety of dietary intervention studies, with different study designs and host environments. However, the authors claims are limited to which methods performed more similarly in which datasets, using also limited evaluation metrics (p-value and effect size for dissimilarity-based analysis, feature rank Spearman correlation for per-OTU analysis). Is the reported percentage variability explained largely consistent with previous literature? Are the important features identified by each method actually biologically plausible? Answering these questions can provide important insight on the validity of each method. In our simulation study, the reported explained variance was very consistent across 100 simulations and strongly agreed between the tested methods. Please see the following examples in the literature: (1) Vandeputte et al. (2017): http://dx.doi.org/10.1136/gutjnl-2016-313271 and (2) Måge et al. (2018): https://www.biorxiv.org/content/10.1101/363630v1. MODERATE 1. I’m not sure if the comparison for Figure 2 was carried out entirely correctly. That is, I wonder if the authors might have rankings that are not comparable between studies. E.g., line 304-306: “Also note that for some methods (FFMANOVA, ALDEx2 and ANCOM), the ranking is related to all levels of the experimental factors, whereas the other methods use pairwise comparisons only.” Pairwise comparison and all-level comparison are essentially different tests, and rankings using their test statistics are not directly comparable. The authors should be careful to compare the same tests across methods. We agree that pairwise and all-level comparisons are not directly comparable. If one chooses to use a method that produces all-level comparisons, one will use them to interpret the results, and vice versa, if one uses a method that only outputs pairwise comparisons, they will be used in the interpretation of the results. Our simulation study was designed with only one factor level different from all others. In this case, the all-level and pairwise comparisons are comparable. We rewrote the paragraph in the revised manuscript (see lines 166-178), and now it is read as follows: “The aim of the method comparison was to investigate how different strategies for statistical modelling affect biological inference. At the community level, methods were compared with respect to effect sizes (expressed as percentage of explained variance) and corresponding p-values. At the OTU level, comparison of methods is complex because some methods provide results for an omnibus test of differences between factor levels (FFMANOVA and ANCOM), whereas the other methods provide ranking for specific pairwise comparisons (ASCA and PLS-DA, SIMPER) or contrasts/model coefficients (ALDEx2 and DESeq2). Even so, a biologist will make inferences based on the output provided by the chosen method, and in this context, it is relevant to compare the ranking statistics although the tests are not the same. In our study, the ranking of OTUs was compared by Spearman’s rank correlation and by investigation of scatterplots between the different ranking metrics. For the simulated data, where we know which OTUs are differentially abundant, True Positive Rate (TPR) and True Negative Rate (TNR) were also evaluated.” 2. The authors claimed LEfSe can only be compared against other methods in two datasets, as the other datasets report too few significant results. However, results for all features can be reported by LEfSe, by setting the most lenient effect size and p-values thresholds. Focusing on datasets with only significant features can artificially inflate LEfSe’s agreement with other methods, as significant features should by definition have more confidence in their effect sizes. LEfSe is a stepwise approach that combines univariate analysis with multivariate discriminant analysis. Even though the method has found wide application in microbiome research, it is not adapted to experimental designs with several multilevel factors and is therefore omitted from the revised version of the manuscript. MINOR 1. I wonder if the visualization for Figure 2 can be improved. The color can be arranged for animal-based and human-based studies to be more similar among each other. Alternatively, the authors might instead consider plotting five method-against-method heat maps, one for each study, to make the current Y-axis less dense. Figure 4 (in the revised manuscript) is a two-panel figure: (A) simulated data and (B) experimental data. We implemented the Reviewer´s suggestion: the colour is currently arranged for animal- and human-based studies to be more similar among each other (brown vs. green, respectively; see Figure 4B). The method pairs (X-axis) are sorted from low to high correlation between the methods (Y-axis). We have also plotted this figure as a heatmap, but the visualization was less user-friendly, therefore we decided to keep the original dotted plot with the suggested modifications by the Reviewer. Reviewer #3: Khomich et. al. provided a review and method comparisons of existing microbiome analysis tools for analyzing intervention design studies. The paper is well-written. They benchmarked five datasets. The analysis was well-conducted. I have some suggestions below: 1. Not very clear to me why intervention design studies were different from the microbiome studies with categorical covariates. In contrast to observational studies, dietary intervention trials are usually small in sample size but performed in (semi)-controlled environments and tailored to a specific research hypothesis. The studies often include multiple experimental factors, possibly with more than two levels. This information can be found in the revised manuscript (lines 69-72). 2. I would appreciate the authors to review the statistical assumptions that are adopted in each of these methods. The review of the methods is very descriptive. For example: “Among abundance-based beta diversity indices, Bray-Curtis is a common choice due to its theoretical properties and empirical accuracy. Widely applied phylogenetic beta diversity indices is UniFrac-type metrics. However, UniFrac is unsuitable as a distance metric for studies with small sample size, which is usually the case for dietary intervention trials.” What does theoretical properties mean? Statistical properties? What does “Empirical accuracy” refer to? Since this is a rather method paper, author should at least explain the method framework and how the advantages and disadvantages related to different methods. Initially we did consider including more details about each method but decided not to include it to keep the manuscript concise. We chose instead to refer to the original publications for further details about statistical assumptions. The sentence about the beta diversity metrics was rephrased and can be read as follows (lines 87-91): “Among abundance-based beta diversity indices, Bray-Curtis is the most common choice for count data. The most widely applied phylogenetic beta diversity indices are UniFrac-type metrics. However, UniFrac is unsuitable as a distance metric for studies with a small sample size, which is usually the case for dietary intervention trials”. 3. There are many places that authors commented on correlation between taxa, but they did not discuss about in their five real datasets, how correlated the taxa are? Did they on purposely choose the ones contain different level of correlation to distinguish results? We are not sure what the Reviewer meant here. We argue that multivariate methods should be applied for microbiome data because different OTUs (and corresponding microbial taxa) will be correlated. We did not choose specific taxa based on correlation structure but used filtering criteria related to abundance to avoid extreme sparsity of the data. 4. Bioinformatic pipeline and data quality that used to pre-process can make differences for the downstream analysis for microbiome studies. Any suggestions on statistical tools which can be better/worse in these situations? We agree with the Reviewer that the bioinformatic pipeline and data quality criteria used can affect the downstream analyses, but this was beyond the scope of our paper. Submitted filename: Response-to-Reviewers_FINAL-14.09.2021.docx Click here for additional data file. 2 Nov 2021 Analysing microbiome intervention design studies: Comparison of alternative multivariate statistical methods PONE-D-21-07978R1 Dear Dr. Khomich, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Lingling An Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The authors carefully revised the manuscript and address all of my points. I recommend its publication. Reviewer #2: I'm satisfied with the authors' revisions. A few minor comments on Figures: 1. The legend for Fig. 2 could be more detailed - it took me a while to understand that the numbers indicate percentage of simulated datasets where p-value was significant. Also the lighter colors for "Few_Low" and "Many_Low" scenarios made them a bit difficult to read. I understand they were designed to be consistent with Fig. 3 and 4 colorings. Maybe the authors could make them just a bit darker. 2. Figure 5: I would guess the red points in panels A and C indicate features that are differentially abundant through simulation. If so, the figure legend should clearly indicate this. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No 8 Nov 2021 PONE-D-21-07978R1 Analysing microbiome intervention design studies: Comparison of alternative multivariate statistical methods Dear Dr. Khomich: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Lingling An Academic Editor PLOS ONE

54 in total

1. A fair comparison.

Authors: Paul I Costea; Georg Zeller; Shinichi Sunagawa; Peer Bork
Journal: Nat Methods Date: 2014-04 Impact factor: 28.547

2. Resources and tools for the high-throughput, multi-omic study of intestinal microbiota.

Authors: Aitor Blanco-Míguez; Florentino Fdez-Riverola; Borja Sánchez; Anália Lourenço
Journal: Brief Bioinform Date: 2019-05-21 Impact factor: 11.622

3. Daily Sampling Reveals Personalized Diet-Microbiome Associations in Humans.

Authors: Abigail J Johnson; Pajau Vangay; Gabriel A Al-Ghalith; Benjamin M Hillmann; Tonya L Ward; Robin R Shields-Cutler; Austin D Kim; Anna Konstantinovna Shmagel; Arzang N Syed; Jens Walter; Ravi Menon; Katie Koecher; Dan Knights
Journal: Cell Host Microbe Date: 2019-06-12 Impact factor: 21.023

4. Identifying and Overcoming Threats to Reproducibility, Replicability, Robustness, and Generalizability in Microbiome Research.

Authors: Patrick D Schloss
Journal: mBio Date: 2018-06-05 Impact factor: 7.867

Review 5. Microbiome Datasets Are Compositional: And This Is Not Optional.

Authors: Gregory B Gloor; Jean M Macklaim; Vera Pawlowsky-Glahn; Juan J Egozcue
Journal: Front Microbiol Date: 2017-11-15 Impact factor: 5.640

6. Comparative Microbiome Signatures and Short-Chain Fatty Acids in Mouse, Rat, Non-human Primate, and Human Feces.

Authors: Ravinder Nagpal; Shaohua Wang; Leah C Solberg Woods; Osborne Seshie; Stephanie T Chung; Carol A Shively; Thomas C Register; Suzanne Craft; Donald A McClain; Hariom Yadav
Journal: Front Microbiol Date: 2018-11-30 Impact factor: 5.640

7. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

Authors: Mark D Robinson; Davis J McCarthy; Gordon K Smyth
Journal: Bioinformatics Date: 2009-11-11 Impact factor: 6.937

8. MixMC: A Multivariate Statistical Framework to Gain Insight into Microbial Communities.

Authors: Kim-Anh Lê Cao; Mary-Ellen Costello; Vanessa Anne Lakis; François Bartolo; Xin-Yi Chua; Rémi Brazeilles; Pascale Rondeau
Journal: PLoS One Date: 2016-08-11 Impact factor: 3.240

9. PERMANOVA-S: association test for microbial community composition that accommodates confounders and multiple distances.

Authors: Zheng-Zheng Tang; Guanhua Chen; Alexander V Alekseyenko
Journal: Bioinformatics Date: 2016-05-19 Impact factor: 6.937

10. Data and Statistical Methods To Analyze the Human Microbiome.

Authors: Levi Waldron
Journal: mSystems Date: 2018-03-13 Impact factor: 6.496

2 in total

1. A Three-Day Intervention With Granola Containing Cereal Beta-Glucan Improves Glycemic Response and Changes the Gut Microbiota in Healthy Individuals: A Crossover Study.

Authors: Vibeke H Telle-Hansen; Line Gaundal; Benedicte Høgvard; Stine M Ulven; Kirsten B Holven; Marte G Byfuglien; Ingrid Måge; Svein Halvor Knutsen; Simon Ballance; Anne Rieder; Ida Rud; Mari C W Myhrstad
Journal: Front Nutr Date: 2022-04-28

2. Investigating differential abundance methods in microbiome data: A benchmark study.

Authors: Marco Cappellato; Giacomo Baruzzo; Barbara Di Camillo
Journal: PLoS Comput Biol Date: 2022-09-08 Impact factor: 4.779

2 in total