Jennifer Tom1, Joseph Nathaniel Paulson1, Justin Williams1,2, Hector Corrada Bravo3. 1. Department of Biostatistics, Genentech, Inc, South San Francisco, CA, 94080, USA. 2. Department of Biostatistics, University of California, Los Angeles, Los Angeles, CA, 90095, USA. 3. Department of Computer Science, University of Maryland, College Park, College Park, MD, 24072, USA.
Abstract
An increasing emphasis on understanding the dynamics of microbial communities in various settings has led to the proliferation of longitudinal metagenomic sampling studies. Data from whole metagenomic shotgun sequencing and marker-gene survey studies have characteristics that drive novel statistical methodological development for estimating time intervals of differential abundance. In designing a study and the frequency of collection prior to a study, one may wish to model the ability to detect an effect, e.g., there may be issues with respect to cost, ease of access, etc. Additionally, while every study is unique, it is possible that in certain scenarios one statistical framework may be more appropriate than another. Here, we present a simulation paradigm implemented in the R Bioconductor software package microbiomeDASim available at http://bioconductor.org/packages/microbiomeDASim microbiomeDASim. microbiomeDASim allows investigators to simulate longitudinal differential abundant microbiome features with a variety of known functional forms with flexible parameters to control desired signal-to-noise ratio. We present metrics of success results on one particular method called metaSplines. Copyright:
An increasing emphasis on understanding the dynamics of microbial communities in various settings has led to the proliferation of longitudinal metagenomic sampling studies. Data from whole metagenomic shotgun sequencing and marker-gene survey studies have characteristics that drive novel statistical methodological development for estimating time intervals of differential abundance. In designing a study and the frequency of collection prior to a study, one may wish to model the ability to detect an effect, e.g., there may be issues with respect to cost, ease of access, etc. Additionally, while every study is unique, it is possible that in certain scenarios one statistical framework may be more appropriate than another. Here, we present a simulation paradigm implemented in the R Bioconductor software package microbiomeDASim available at http://bioconductor.org/packages/microbiomeDASim microbiomeDASim. microbiomeDASim allows investigators to simulate longitudinal differential abundant microbiome features with a variety of known functional forms with flexible parameters to control desired signal-to-noise ratio. We present metrics of success results on one particular method called metaSplines. Copyright:
Analysis of the microbiome aims to characterize the composition and functional potential of microbes in a particular ecosystem. Recent studies have shown the gut microbiome plays an important role in various diseases, from the efficacy of cancer immunotherapy to the pathogenesis of inflammatory bowel disease (IBD)
[1–
4]. While many studies profile static community “snapshots”, microbial communities do not exist within an equilibrium
[5]. To better understand bacterial population dynamics, many studies are expanding to longitudinal sampling and foregoing cross-sectional or single time-point explorations. With a decrease in sequencing costs, more longitudinal data will be generated for varying communities of interest. While data generation will present fewer difficulties, there remain several statistical challenges involved in analyzing these datasets.The common approach in the marker-gene survey literature is to perform pairwise differential abundance tests between specific time points and visually confirm, sometimes using smoothing methods like splines, how differences are manifested across time
[6]. These methods require that analysts provide one or more specific time points to test, and the statistical inferences derived from these procedures are specific to these pairwise tests. Other standard methods for longitudinal analysis test for global differences across time, sometimes using non-linear methods including splines to capture dynamic profiles across time
[7]. Incorporating confounding sources of variability, both biological and technical is essential in high-throughput studies
[8] and require statistical methods capable of estimating both smooth functions and sample-specific characteristics.Simulating marker-gene amplicon sequencing data presents a variety of challenges related to biological and technical limitations when collecting data. We present a framework for simulating data that can be used across multiple methods for estimating longitudinal differential abundance. This simulation framework allows for appropriate comparison between methods while taking into account some of the unique challenges for the marker-gene amplicon sequencing data, including the following:Non-negative restrictionPresence of Missing Data/High Number of Zero ReadsLow Number of Repeated MeasurementsAsynchronous Repeated MeasuresSmall Number of SubjectsThe first two challenges described above are related to the data generating process itself while the following three represent logistical challenges often faced when collecting the data. In
microbiomeDASim
[9], we attempt to address these data generating challenges through specific simulation mechanisms described in the
Microbiome adaptions section. Similarly, logistical challenges are addressed by allowing users to specify these values flexibly and investigate the corresponding effects, tailoring the simulation to an appropriate setting.This package allows investigators to simulate longitudinal differentially abundant microbiome features with a variety of known functional forms along with flexible parameters to control design aspects such as signal to noise ratio, correlation structure, and effect size. This feature simulation paradigm can be used in study design evaluation by either matching previously observed trends from small scale studies or evaluating the power to detect differential abundance with a specified study duration, sample size, effect size, effect shape and sample collection schedule. We highlight the ability of the package to use results from a historical longitudinal study on the humangut microbiome in gnotobiotic mice
[10] to simulate differential abundance for a hypothetical large scale expansion of this study and then demonstrate using the simulation package to evaluate the performance of one particular method of differential abundance estimation across a range of parameter values, metaSplines
[11].
Methods
Distributional assumptions
Sequencing data are often non-normal. However, transformations, such as log(·) or arcsinh(·), are often applied to raw marker-gene amplicon sequencing data so that the subsequent data is approximately normally distributed. As such, we generate simulated data from a multivariate normal distribution. Using a multivariate normal is a natural choice in this setting as longitudinal correlation structure can be easily incorporated. The following methods focus on cases where the desired microbiome features following appropriate transformation are approximately normally distributed.Assume that we have data generated from the following distribution,wherewith
Y
representing the
i
individual at the
j
time point and each individual has
q
repeated measurements with
i ∈ {1, … ,
n} and
j ∈ {1, … ,
q
}. We define the total number of observations as
While this model holds for different choices of
q
, throughout this article we will assume, without loss of generality, that the number of repeated measurements is constant, i.e.,
q
=
q ∀ i ∈ {1, … ,
n}. This means that the total number of observations simplifies to the expression
N =
qn. Similarly, we split the total patients (
n) into two groups, control (
n
0) and treatment (
n
1), with the first
n
0 patients representing the control patients and the remaining
n–n
0 representing the treatment patients. Subsequently we define the total number of observations in each group as
N
0 =
n
0 ·
q and
N
1 =
n
1 ·
q respectively.
Y represents a single taxa/feature to be simulated across the
N samples. When simulating multiple features as shown later in the
gen_norm_microbiome, these features are assumed to be independent.
Mean components
Partitioning our observations into control and treatment groups in this way allows us to define the mean vector separately for each group as
= (
0,
1) where
0 is an
N
0 × 1 vector and
1 is an
N
1 × 1 vector. To generate differential abundance the mean for the control group is held constant
µ
0
1
, but allow the mean vector for the treatment group to vary as a function of time
µ
1
(
t) =
µ
0 +
f(
t
) for
i = 1, … ,
n
1 and
j = 1, … ,
q. The form of
f(
t
) will dictate the functional form of the differential abundance. Note that if
f(
t
1) = 0, then both groups have equal mean at baseline.
Polynomial functional forms
We allow
f(
t
) to be specified using polynomial basis asfor a
p dimensional polynomial. We restrict the allowed polynomials to be either linear,
p=1, quadratic,
p = 2, or cubic,
p = 3. For instance, to define a quadratic polynomial one would specify
= (
β
0,
β
1,
β
2)
in the following equation,Again, it is important to note that if
=
0, that the treatment group is assumed to have no differentially abundant timepoints. Typically to simulate no differential abundance, a linear trend is chosen with
β
0 =
β
1 = 0.
Oscillating functional forms
While polynomial functions are often natural choices for longitudinal trends, interest also lies in exploring other non-smooth, i.e., non-differentiable, types of trends. One such form we refer to as oscillating functional forms. These trends include types that transition from linearly increasing to linear decreasing at a point, or vice versa from linearly decreasing to linear increasing. One of the most well known trends of this type is the absolute value function. To allow for flexible choices in oscillating type trends, we allow for these non differentiable linearly connected trends to repeat forming what we call M and W trends. From a biological perspective we could think of these trends as representing spikes in a particular feature that may occur immediately after a treatment dose is given, but then decays rapidly to baseline levels followed by a similar spike and decay upon repeated dosing. These functional trends are operationalized aswhere IP
for
k = 1, 2, 3 denotes an inflection point where the linear trend changes from increasing to decreasing or vice versa. Note that for these types of trends that the sign of
β
1 determines whether the trend is initially increasing, i.e. M, (
β
1 > 0) or initially decreasing, i.e. W, (
β
1 < 0). By construction, we force the trend line to be exactly zero at IP
2 and by doing so the trend is specified completely as
= (
β
0,
β
1)
and
IP = (IP
1, IP
2, IP
3)
. An implicit restriction on the functional trend is that IP
3 ≠
t
. However, we can construct absolute value and inverted absolute value type trends by defining IP
1 ∈ (
t
1,
t
) and IP
2, IP
3 >
t
. Again, the key difference for these set of trends is that the inflection points create non-smooth trends.
Hockey stick functional forms
An additional extension to linear functional trends is the family of Hockey Stick functional forms. There are two available families of hockey stick functional forms, which are referred to as L_up and L_down within the package. Both of these trends are designed to create two mutually exclusive regions over the time frame specified. These two regions are defined as
ℜ
1 = (
t
1, IP) and
ℜ
2 = (IP,
t
) where one of the regions
ℜ
1 or
ℜ
2 has linear differential abundance while the other has no differential abundance and IP denotes the inflection point. In the case of the L_up trend,
ℜ
1 is defined as the non-differentially abundant region and
ℜ
2 is a linearly increasing region. We can define the functional form asNote that with this specification that we do not specify the intercept
β
0 and instead only need to specify the slope term
β
1 and the appropriate point of change. We restrict the slope term to be positive, i.e.,
β
1 ∈ (0, ∞) to create the "up" trend.Conversely, the L_down trend assumes that
ℜ
1 is a differentially abundant region that begins with the treatment group higher than the control group and then linearly decreases to the region
ℜ
2 where there is no differential abundance. We define this functional form asNote that in this case we do not specify the point of change directly, but rather it is implicitly implied by the choice of
β
0 and
β
1, i.e. IP = –
β
0/
β
1. To ensure that the trend in
ℜ
1 is properly specified, we place additional restrictions on the parameters so that
β
0 ∈ (0, ∞) and
β
1 ∈ (–∞, 0) to ensure the trend is decreasing and check that the choice of
β
0 and
β
1 are appropriately defined so that IP ∈ (
t
1,
t
).Example trends are shown in
Figure 1 generated using the
mean_trend function.
Figure 1.
Different functional forms available using the
mean_trend() function.
Covariance components
As discussed in the
Introduction, the multivariate normal is a natural choice for longitudinal simulation due to the ease with which dependency of repeated measures is specified. To encode this longitudinal dependency observations within an individual are assumed to be correlated, i.e. Cor(
Y
,
Y
) ≠ 0 ∀
j ≠
j' and
i ∈ {1, … ,
n}, but observations between individuals are assumed independent, i.e. Cor(
Y
,
Y
) = 0 ∀
i ≠
i' and
j ∈ {1, … ,
q
}. To accomplish this we define the block diagonal matrix
Σ as
Σ = bdiag(
Σ
1, … ,
Σ
), where each
Σ
is a
q ×
q covariance matrix for individual
i and bdiag(·) indicates that the matrix is block diagonal with all off diagonal elements not in
Σ
equal to zero. For each individuals covariance matrix, we assume a global standard deviation parameter and correlation component
ρ, i.e.
Σ
=
σ
2
(
ρ).For instance, if we want to specify an autoregressive correlation structure for individual
i the covariance matrix is defined asIn this case we are using the first order autoregressive definition and therefore will refer to this as AR(1).Alternatively, for the compound correlation structure for an individual
i' we define the covariance matrix asFinally, we allow the user to specify an independent correlation structure for an individual
i'', which assumes that repeated observations are in fact uncorrelated and is defined asEach of these correlation structures are referred as AR(1), compound, and independent respectively.
Microbiome adaptions
As discussed in the
Introduction, simulating microbiome data presents a variety of unique challenges. In particular there are two data generating restrictions, 1. non-negative restriction and 2. presence of missing data/high number of zero reads, that must be addressed when simulating this data. In this section we will outline some of the specific adaptions of the simulation framework designed to address these issues.One of the most relevant challenges faced with microbiome data, is the restriction of the domain to non-negative values. To assure that the simulated normalized counts are non-negative, one solution is to simply replace the multivariate normal distribution with a multivariate
truncated normal distribution. The new data generating distribution is nowwhere TN indicates the multivariate truncated normal distribution and
a is the left-truncation value. To impose zero truncation it is assumed that
a = 0. Values from the multivariate truncated normal are drawn using the package
tmvtnorm
[12]. Note that the default method for drawing observations from this distribution is rejection sampling which proceeds by first drawing from a multivariate normal and then for all values that fall below
a to reject the observed sample and re-sample. This procedure works well when the majority of the distribution falls above the truncation point, but can be computational intensive when the probability of acceptance,
p
acpt =
P(
Y >
a
1
), is low. In our simulation design if the value of
is sufficiently close to
a then rejection sampling is not feasible. In the case there the
p
acpt ≤ 0.1, non-negative restriction is imposed by censoring negative values and using point imputation with the truncation value
a as shown belowTo remove the non-negative restriction there is an option in the function
mvrnorm_sim which can be used to turn-off the domain restriction, but by default the zero truncation is imposed. Note that an alternative option to using the multivariate truncated normal is to use the Johnson translation system which can allow samples to be drawn from a multivariate log normal distribution via an appropriate translation function
[13]. The current implementation uses only the multivariate truncated normal distribution for drawing samples via the
zero_trunc option within the
mvrnorm_sim() and
gen_norm_microbiome() functions.The second major data generating challenge when simulating microbiome data is the presence of missing data along with a high percentage of features with zero counts. Based on technical limitations when amplifying and sequencing microbiome data, certain features may be present but remain undetected. To approximate this potential for missing features that are truly present, options within
mvrnorm_sim allow the user to specify: 1) the percent of individuals to generate missing values from (
missing_pct), 2) the number of measurements per individual to assign as missing (
missing_per_subject), and 3) the value to impute for missing observations (
miss_val). Sample IDs are randomly chosen without replacement across all
n units and for each selected ID measurements are randomly selected without replacement from {
t
2 , … ,
t
} until the specified number of measurements per individual is achieved. For each missing measurement selected the observed value is replaced with the user specified missing value. Typically the missing value is specified as 0 or as
NA with the first case representing a situation where the feature was not included due to technical limitations and the second representing an individual whose data was not collected for a particular time point. The initial value
t
1 cannot be assigned as missing since it is assumed that all individuals have baseline values collected.
Implementation
The current version of the R Bioconductor software package
microbiomeDASim
[9] can be installed in R with the following executable code:Alternatively, a development version is available from GitHub and can be accessed at the following repository
williazo/microbiomeDASim. The developmental version may contain additional features that are being developed before they are officially introduced into the Biocondutor version. The developmental version can be installed using the following code:For a guided introduction into using the functions see either the package vignette for a static example of how to set up and interact with various options for simulating data or for a dynamic guide see
mvrnorm_demo.ipynb, a Jupyter notebook on the GitHub page under the inst/script directory. This notebook can be loaded using Google Collab allowing the code to be run without installing Jupyter locally.
Operation
microbiomeDASim
[9] is compatible with major operating systems including Mac OS, Windows and Linux. Package dependencies and system requirements are outlined in the documentation available at GitHub.
Use cases
Data generating procedure
The primary mechanism for simulating data in the
microbiomeDASim package
[9] is the function
mvrnorm_sim. Through this function, the number of subjects in each group is specified along with the necessary parameters, i.e
,
σ
2,
ρ, and
IP, to generate
and
Σ. Below is an example of generating differential abundance using a quadratic trend. This type of example could be part of an initial attempt to understand the effects of proposed sample sizes per group, hypothetical functional forms for differential abundance, and sensitivity to signal to noise ratios. In this case there may be a sparsity of empirical evidence and many possible simulation designs can be tested, or on the other end of the spectrum the ecological process could be well understood and the parameter values are well known with emphasis focused on constraints such as collection timepoints and sample size.The output of the simulation function is a list with 7 total objects. The main object of interest is
df, which is a data.frame that contains the complete outcome,
Y, IDs for each subject
i = 1, … ,
n, the corresponding time for each observation
t
, a group variable indicator, and the outcome with missing data,
Y_obs. The time interval of interest must be specified as a parameter in
t_interval, and by default timepoints are drawn at equidistant points along this interval. Both the complete and missing data vectors are also returned as independent objects,
Y and
Y_obs, respectively, along with the complete mean,
=
Mu, and covariance matrix,
Σ=
Sigma. The function also includes a data.frame
miss_data which lists any IDs and time points for which missing data was induced. Finally, the function also returns the total number of observations,
N=Σ
. The option dis_plot is used to automatically generate a time-series plot tracking each individuals trajectory along with group mean trajectories. The corresponding plot for this data is shown in
Figure 2a.
Figure 2.
Simulating a quadratic differential abundance trend with compound correlation structure and parameters:
β = (0, 3, − 0.5)
,
ρ = 0.7,
σ = 1,
n
0 =
n
1 = 20,
q = 6.
Missing data in
Figure 2b is generated with 20% of subjects randomly selected to have missing values and for each of these subjects to have 2 non-baseline times randomly selected to be missing with the missing observations imputed as 0.
Simulating a quadratic differential abundance trend with compound correlation structure and parameters:
β = (0, 3, − 0.5)
,
ρ = 0.7,
σ = 1,
n
0 =
n
1 = 20,
q = 6.
Missing data in
Figure 2b is generated with 20% of subjects randomly selected to have missing values and for each of these subjects to have 2 non-baseline times randomly selected to be missing with the missing observations imputed as 0.One important thing to note about the example above is that we generated no missing observations as both missing_pct and missing_per_subject were set to 0. Therefore
miss_data was empty. We can compare this to the case below where we induce missingness into the data.In this case we see that for
t
5 and
t
6 for subject 6 that our outcome with missing data, Y_obs, is now set as 0 which was specified as our missing value while the complete data has the original value before inducing missingness. Another feature demonstrated in this second example is using the
asynch_time option. When this variable is set to true, timepoints are randomly drawn from a uniform distribution over the interval [
t
,
t
]. By construction it is assumed that all individuals have a baseline measurement recorded at t
0, but all remaining timepoints are drawn at random. The corresponding plot of the outcome
Y_obs for this simulation which contains the induced missing observations and asynchronous time measurements is shown in
Figure 2b.As mentioned in the
Distributional assumptions section, data are generally generated one feature at a time. However, we may want to simultaneously create data with similar patterns across a number of features with certain features experiencing differential abundance while others have no differential abundance patterns. To do this we can use the function
gen_norm_microbiome which lets users specify the number of total features to simulate,
features, and the number of total features to be differentially abundant,
diff_abun_features. In the example below 10 total features are generated with 4 features having longitudinal differential abundance with an L_down hockey stick type trend.There are two objects returned in this function,
bug_feat and
Y. The object
bug_feat contains all of the sample specific information including Subject ID, timepoint
t
, an indicator for group assignment and the Sample_ID which ranges from Sample_1 up to Sample_N. The other object
Y is the typical OTU (operational taxonomic unit) table with rows corresponding to features and column to samples that are commonly used for analysis in packages such as
metagenomeSeq
[14,
15] and
phyloseq
[16]. There are two additional helper functions that will convert the simulated data into MRexperiment or phyloseq objects respectively to allow practitioners to use simulated data in either of these familiar environments.
Approximating observed microbiome data
Another important goal of the simulation software is the ability to closely approximate real data from longitudinal experiments where sequencing was performed. To demonstrate this ability using
microbiomeDASim we will approximate observed data from a longitudinal study on the humangut microbiome in gnotobiotic mice
[10]. This data file is available within the
metagenomeSeq package, and is particularly interesting to simulate for several reasons. The experiment was performed with a total of 12 mice, 6 in each treatment arm, to test the effect of a low-fat, plan polysaccharide-rich diet (BK) versus a high-fat, high-sugar (Western) diet. This small scale study showed promising results and may warrant a larger scale clinical design to investigate the robustness of the effect of diet on the gut microbiome. As such, we can use the simulation tools to generate hypothetical results for this large scale trial assuming that we observe either the same functional trend as the original study or any of the possible hypothetical functional trends at our disposal, including no differential abundance. We will show how to generate hypothetical data for a large scale version of this experiment by increasing the sample size by five fold and replicating the observed functional trend for a particular feature of interest.As a first step we need to identify a particular feature of interest at an appropriate taxonomic level. The original data contains sequenced counts on over 10,000 OTUs with the majority of these being extremely low frequency features. Since the total sample size (n=12) is too small for central limit theory approximations to be valid, we aggregate counts to the genus level for modelling. We further filter genus level features by imposing a minimum depth of 1000 and presence of 10, leaving a set of 35 features. Of these 35 features, we select one at random which we will want to replicate using our simulation framework. In our case we select the genus
Sutterella. The raw sequencing counts are then log normalized using the default procedure available in the
metagenomeSeq package which will serve as our primary outcome of interest. We plot these results over time as shown in
Figure 3.
Figure 3.
Observed longitudinal trends for the two diet groups in Turnbaugh
et al.
[10] study for the Genus
Sutterella with estimated LOESS curves for each group.
Note that both groups had equivalent diets over the first 21 days with half of the mice switching to the Western diet at this point marked with the vertical dotted line.
Observed longitudinal trends for the two diet groups in Turnbaugh
et al.
[10] study for the Genus
Sutterella with estimated LOESS curves for each group.
Note that both groups had equivalent diets over the first 21 days with half of the mice switching to the Western diet at this point marked with the vertical dotted line.There is significant variability between the groups with a marked decrease in both groups prior to the implementation of the intervention. We see a bounce back effect to baseline levels occurring in the BK diet group while the Western diet group have significantly lower values across the remainder of the study period. We see that the measurement timepoints for each individual vary slightly and are not equally spaced over the entire study window.As the primary interest lies in the difference between the diet groups across time, we develop our simulation model by re-scaling the BK reference group to a constant level across time and allowing theWestern group to vary. To obtain an initial estimate of this treatment functional trend we use the
metagenomeSeq
[15] package to fit a Gaussian smoothing spline ANOVA (SS-ANOVA) shown in
Figure 4.
Figure 4.
Estimated functional form of the longitudinal differential abundance for the Western diet group from Turnbaugh
et al.
[10] study for the Genus
Sutterella.
The black line represents the point estimate with the dashed red lines corresponding to 95% confidence intervals fit using Gaussian smoothed spline ANOVA.
Estimated functional form of the longitudinal differential abundance for the Western diet group from Turnbaugh
et al.
[10] study for the Genus
Sutterella.
The black line represents the point estimate with the dashed red lines corresponding to 95% confidence intervals fit using Gaussian smoothed spline ANOVA.We see that over the initial 21 days that the 95% confidence intervals for the differential abundance overlaps zero, and that after the intervention begins that the Western group is significantly lower than the BK group. While the estimated trend is non-linear, we may expect that this is a function of small sample size noise and that the true functional trend is a linearly decreasing trend. We therefore construct our hypothetical functional form using the
L_up designation assuming there is no differential abundance over the interval
t ∈ [0, 21] followed by a linearly decreasing trend over the interval
t ∈ (22,80]. We show this chosen functional form alongside the estimated differential abundance in
Figure 5.
Figure 5.
Estimated functional form of the longitudinal differential abundance for the Western diet group from Turnbaugh
et al.
[10] study for the Genus
Sutterella with the corresponding functional form chosen for the simulation shown in blue.
The black line represents the point estimate with the dashed red lines corresponding to 95% confidence intervals fit using Gaussian smoothed spline ANOVA.
Estimated functional form of the longitudinal differential abundance for the Western diet group from Turnbaugh
et al.
[10] study for the Genus
Sutterella with the corresponding functional form chosen for the simulation shown in blue.
The black line represents the point estimate with the dashed red lines corresponding to 95% confidence intervals fit using Gaussian smoothed spline ANOVA.In general our hypothetical trend is contained within the estimated bounds of the smoothed fig, and we may believe that it is an ecologically valid representation of the expected change over time.With
microbiomeDASim we can use the observed times for each ID, and replicate each subject five times creating a total sample size of n=60 with 30 mice in each treatment arm. We use the data to obtain estimates for
sigma and
control_mean along with the functional form chosen above to generate the simulated data using the
mvrnorm_sim_obs() function with an AR1 correlation structure. The results for the simulated data are shown in
Figure 6.
Figure 6.
Estimated functional form of the longitudinal differential abundance for the Western diet group from Turnbaugh
et al.
[10] study for the Genus
Sutterella with the corresponding functional form chosen for the simulation shown in blue.
The black line represents the point estimate with the dashed red lines corresponding to 95% confidence intervals using metaSplines to fit smoothed spline ANOVA.
The black line represents the point estimate with the dashed red lines corresponding to 95% confidence intervals using metaSplines to fit smoothed spline ANOVA.This simulated data could then be used to conduct power analyses of detecting differential abundance at time
t ∈ [
t
0,
t
] or this process could be repeated multiple times to generate feasible bounds for what the trend may look like in this larger sample. Alternatively, the observed data could be altered to change the planned time point measurements to see the effect of collecting fewer samples during the follow-up period. In addition, as mentioned earlier in this section multiple functional forms could be tested including situations where no differential abundance is observed to determine the likelihood of committing Type 1 errors. Further details and code for this example are available on GitHub at
inst/script/mouse_microbiome_approximation.pdf
Longitudinal differential abundance estimation
Next, we want to use our simulation design to test some of the available methods to estimate longitudinal differential abundance. We will examine properties of the estimation method available in the
metagenomeSeq
[15] package to fit a Gaussian smoothing spline ANOVA (SS-ANOVA) model
[11,
17,
18] referred to here after as the metaSplines method. We start by generating our simulated data. In this example we will fix parameters to have
q = 10 repeated measurements on each individual with
n
0 =
n
1 = 30 individuals per arm.After generating the simulated data, we fit the model. Note that one can fit either the outcome with the complete data or the outcome with imputed missing data. In this example we use the complete data. To use the induced missing data when creating the MRexperiment object we would set the missing variable in
simulate2MRexperiment to TRUE.Now we can display the estimated interval of differential abundanceWe compare the estimated trend
to the truth
f(
t
) as shown in
Figure 7. We observe that the metaSplines estimate falls closely to the true functional form. Further, the confidence intervals for the functional form completely contain the true trend reflecting that the variability in estimation is accurately reflected.
Figure 7.
Comparison of the estimated functional form for the metaSplines method, in red, to the truth, in black.
Evaluating estimation procedures
In the example for metaSplines above we looked at performance using a visual inspection for a single choice of parameter values. Using our simulation framework we can expand our investigation of performance. By knowing the true underlying functional form we can quantify how accurate a particular estimation method captures the truth as a function of sample size per group, number of repeated observations, signal-to-noise strength, type of functional form etc. In order to use the simulated data to compare different longitudinal methods for estimating differential abundance we need to define performance metrics that quantify how accurate an estimate is to the truth. We propose four different performance metrics that can be used when comparing methods.Sensitivity/Specificity ∈ [0, 1]Cosine SimilarityEuclidean DistanceNormalized Euclidean DistanceTo ensure robustness, for each set of parameter values simulated multiple repetitions,
B, are required. Sensitivity is defined as the number of repetitions where
any differential abundance at any value
t
∊ {
t
1, . . . ,
t
} is detected over the total number of repetitions given that the functional form had some true differential abundance over time, i.e.
f (
t
) ≠ 0 ∀
t
⇔
≠
. Likewise, specificity is defined as the number of repetitions where no differential abundance was detected across
all timepoints over the total number of repetitions given that the function form had no true differential abundance over time, i.e.,
f (
t
) = 0 ∀
t
. The other remaining metrics are continuous values that look to compare how closely the estimated mean trend is to the true trend at a set of points
t
∊ {
t
1, . . . ,
t
}. Cosine similarity is comparable across different lengths of
t, but is not particularly discriminant especially near the boundaries around –1 and 1. The Euclidean distance quantifies how far apart each point is but the length of
t is highly influential. Therefore, to make the Euclidean distance comparable across different lengths of repeated observations we can use the normalized Euclidean distance which first transforms the estimated and true functional form into unit vectors and then calculates the distance between these unit vectors.
Sensitivity and specificity results
Using these performance metrics we simulated data across a range of different parameters settings and then estimated the functional form of the trend using the metaSplines procedure described earlier for a total of 100 repetitions for each parameter setting. Below we show the performance results for a simulation where the functional form was fixed as L_up with an AR(1) correlation structure,
ρ = 0.7, and varied the sample size per group, standard deviation, and timepoints from small, medium, and large respectively. The corresponding sensitivity and specificity results are shown in
Figure 8a and
Figure 8b.
Figure 8.
Sensitivity and specificity results for L_up Hockey Stick type trend for an AR(1) correlation structure with parameters:
β = 1, IP = (
t
+1)/2,
ρ = 0.7.
Remaining parameters were varied to create 27 different combinations of repeated measurements, sample size per group, and
σ. Points plot are the average result of
B = 100 repetitions.
Sensitivity and specificity results for L_up Hockey Stick type trend for an AR(1) correlation structure with parameters:
β = 1, IP = (
t
+1)/2,
ρ = 0.7.
Remaining parameters were varied to create 27 different combinations of repeated measurements, sample size per group, and
σ. Points plot are the average result of
B = 100 repetitions.Looking at
Figure 8a, in general the sensitivity decreases as
σ increases for a fixed sample size and
q. For example when
n
0 =
n
1 = 10 and
q = 6 the estimation procedure is perfectly sensitive (100%) when
σ = 1 but has lower sensitivity (42%) when
σ = 4. Also as the sample sizes increases for a fixed
q and
σ, sensitivity generally increases. Likewise, as the number of repeated observations increase, i.e.
q increases, the sensitivity increases quite dramatically. This figure suggests that 6 repeated measurements is sufficiently large to detect differential abundance for strong (
σ = 1) or medium (
σ = 2) signals regardless of the sample size per group. On the other hand, we can look at the specificity in
Figure 8b to see that these trends are no longer monotonic. In general we note that as
q increases the specificity decreases and that as
σ increases the specificity tends to increase. However, the trend for sample size is more nuanced and may variable due to the number of repetitions that were estimable. Using the metaSplines method there were cases with small sample size and repeated observations that the method returned no estimate.The sensitivity results shown above were for a single choice of functional form, but this is another potential parameter of interest to test. We ran a similar set of parameter combinations for 7 other functional forms shown in
Table 1 below to compare the sensitivity as a function of the type of trend. In this table we can see that the non-differential trends, Oscillating, and variable trends, Hockey Stick, had lower average sensitivity while the linear and quadratic trends tended to perform the best.
Table 1.
Estimated sensitivity from metaSplines method for data simulated from each respective functional form for a total of 100 repetitions across 27 different parameter settings fixing the correlation structure to be AR(1) with
ρ = 0.7.
Parameter values used:
σ ∊ {1, 2, 4},
n
0 =
n
1 ∊ {10, 20, 50},
q ∊ {3, 6, 12}. Note that the Total Non-Missing Observations is less than the Total Observations.
Functional Form
Sensitivity
Total Repetitions
Non-Missing Estimates
Linear Increasing
1.00
2700
2686
Linear Decreasing
0.97
2700
2634
Quadratic: Concave Up
0.91
2700
2154
Quadratic: Concave Down
0.95
2700
2600
Oscillating 1
0.96
2700
2614
Oscillating 2
0.84
2700
2501
Hockey Stick 1
0.78
2700
2261
Hockey Stick 2
0.77
2700
2280
Estimated sensitivity from metaSplines method for data simulated from each respective functional form for a total of 100 repetitions across 27 different parameter settings fixing the correlation structure to be AR(1) with
ρ = 0.7.
Parameter values used:
σ ∊ {1, 2, 4},
n
0 =
n
1 ∊ {10, 20, 50},
q ∊ {3, 6, 12}. Note that the Total Non-Missing Observations is less than the Total Observations.
Continuous performance results
The continuous performance metrics for the cosine similarity, Euclidean distance and normalized Euclidean distance are shown in
Figure 9 for the L_up trend with AR(1),
ρ = 0.7. From this figure we see similar trends as the sensitivity results. Starting from the left most panel we see that the cosine similarity is highest when
σ is small,
q,
n
0,
n
1 are large. The spread of cosine similarity scores when
q = 12 are very tightly clustered around 1 while the spread of values when
q = 3 or
q = 6 is larger. The center plot illustrates that using raw Euclidean distances with a small number of repeated measurements tend to have smaller distances, but this trend is not seen with normalized Euclidean distance in the last panel. Within each value of
q in this middle panel there is a consistent trend that as the sample size per group increases the distance generally decreases. Finally moving to the last panel we have the normalized Euclidean distance, which can now be used to compare across different repeated measurement panels. We see a similar trend to the cosine similarity where the distance decreases, meaning better performance, for small
σ and large
q and
n
0 =
n
1.
Figure 9.
Estimated values of performance metrics including cosine similarity, Euclidean distance, and normalized Euclidean distance based on 100 repetitions for an L_up Hockey Stick trend with AR(1) correlation structure,
ρ = 0.7, simulated across multiple settings varying repeated measurements
q, sample size per group,
n
0 and
n
1 and
σ.
Note that the red dashed line serves as a reference point at 0.5 and the green dot in each panel represents the mean value across the 100 repetitions
Estimated values of performance metrics including cosine similarity, Euclidean distance, and normalized Euclidean distance based on 100 repetitions for an L_up Hockey Stick trend with AR(1) correlation structure,
ρ = 0.7, simulated across multiple settings varying repeated measurements
q, sample size per group,
n
0 and
n
1 and
σ.
Note that the red dashed line serves as a reference point at 0.5 and the green dot in each panel represents the mean value across the 100 repetitionsSimilar to the sensitivity performance metrics shown in
Table 1, we can also compare the average value of the continuous performance metrics based on functional form. This is shown in
Table 2. Similar trends appear in this table with the linear trends having the highest average cosine similarity scores and lowest average normalized Euclidean distance and non-differentiable trends peforming worse.
Table 2.
Average continuous performance metrics from metaSplines method for data simulated from each respective functional form for a total of 100 repetitions across 27 different parameter settings fixing the correlation structure to be AR(1) with
ρ = 0.7.
Parameter values used:
σ ∊ {1, 2, 4},
n
0 =
n
1 ∊ {10, 20, 50},
q ∊ {3, 6, 12}. Note that the Total Non-Missing Observations is less than the Total Observations.
Functional Form
Total
Repetitions
Non-Missing
Estimates
Avg. Cosine
Similarity
Avg. Euc.
Distance
Avg. Norm.
Euc. Distance
Linear Increasing
2700
2686
0.99
1.26
0.07
Linear Decreasing
2700
2634
0.98
1.27
0.09
Quadratic: Concave Up
2700
2154
0.94
1.60
0.23
Quadratic: Concave Down
2700
2600
0.97
1.55
0.15
Oscillating 1
2700
2614
0.97
1.69
0.14
Oscillating 2
2700
2501
0.88
1.71
0.35
Hockey Stick 1
2700
2261
0.84
1.35
0.40
Hockey Stick 2
2700
2280
0.84
1.38
0.38
Average continuous performance metrics from metaSplines method for data simulated from each respective functional form for a total of 100 repetitions across 27 different parameter settings fixing the correlation structure to be AR(1) with
ρ = 0.7.
Parameter values used:
σ ∊ {1, 2, 4},
n
0 =
n
1 ∊ {10, 20, 50},
q ∊ {3, 6, 12}. Note that the Total Non-Missing Observations is less than the Total Observations.
Conclusions
With an increasing emphasis on understanding the dynamics of microbial communities in various settings, longitudinal sampling studies are underway. There remain many statistical challenges when dealing with longitudinal data collected from marker-gene amplicon sequencing. In order to validate and compare methods of estimation for longitudinal differential abundance a unified simulation framework is needed. Currently available simulation tools include R packages
seqtime
[19] and
untb
[20]. These packages focus primarily on simulation from the perspective of ecological processes aimed to capture the entire community dynamics. With
microboimeDASim package
[9] we instead provide the tools to simulate various functional forms for longitudinal differential abundance with added flexibility to control important factors such as the number of repeated measurements per subject, the number of subjects per group, within subject correlation, sequencing of time measurements, etc. for a specific feature of interest. We have shown the benefit of these simulation tools by constructing a simulation design based on real microbiome data and showed the utility in methods evaluation using the metaSplines estimation procedure to compare the performance across a wide range of different parameter settings. In this manner the
microbiomeDASim helps meet an important need in the research community to help in study design and compare existing methods as well as validate potentially novel methods.
Data availability
All data underlying the results are available as part of the article and no additional source data are required.
Software availability
microbiomeDASim is available at:
http://bioconductor.org/packages/microbiomeDASim.Source code available from:
https://github.com/williazo/microbiomeDASimArchived source code at time of publication:
https://doi.org/10.5281/zenodo.3458563
[9].License:
MIT.The authors have responded to my review comments appropriately.I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.ContributionsThe authors have developed an R package to simulate longitudinal microbiome time course data, especially where there are difference in trajectories between treatment and control groups. This can be used to address,Experimental design: Simulations can guide power analysis, to see whether a proposed study will be well-powered, as a function of assumptions on the generating mechanisms.Methods comparisons: The effectiveness of different methods will depend on the structure of the data, and simulations provide ground truth from which to make assessments.They simulate data one species at a time. Both treatment and control groups are assumed to have gaussian data, truncated below at 0 to reflect transformed counts. Control data are assumed to be drawn from some common mean, but with specified correlation structure over time. Treatment data are assumed to have a mean that deviates from the control according to some function f(), but have the same correlation structure. The authors provide an interface for simulating a few patterns of f() that are believed to be common in real data (e.g., oscillating, quadratic, and linear shapes).The authors share code to display simulated data. They also describe a study evaluating the power of a particular method, 'metaSplines', as simulation parameters are changed.EvaluationStrengths:I like the idea of formalizing simulation-based power analysis. In the microbiome setting, simulations make more sense than theory, but have two issues (1) they are potentially labor-intensive and (2) they can be ad hoc, and never published. By preparing a package, the authors lower the barrier to entry to / introduce a more formal standard for this work, hopefully enabling simulation-based power analysis in the field.The paper is generally technically sound, and reads well. Code is available publicly, is clearly documented, and written in a professional style.Weaknesses:The simulated data are never properly evaluated -- this is my reason for the "partly" response in my report. Of course, any simulation is only an approximation of reality, but it would be nice to know along which dimensions the approximation is close, and along which it is poor. This would also set the stage for studying whether the conclusions that you're aiming for (study design or methods choices) are substantially affected by / robust to these deviations in real data. Something in the spirit of graphical inference could be quite interesting here.
[1]Missed Opportunities:The 'metaSplines' analysis ends somewhat abruptly, because it's not clear what actual conclusions would be drawn from it. I think it would be interesting if you compared another method against it, because you'd be getting at something like the relative efficiency of the approaches (you could also measure their robustness to particular assumptions).The functional forms seem somewhat restrictive, though I see their value for people who don't want to spend time writing code. Could you define some kind of interface that makes it easier for people to specify classes of alternatives? E.g., maybe you could let people draw functions interactively, or use as input some examples of microbiome series they see in real data.DiscussionI have trouble believing in any kind of i.i.d. assumption across species. First, the scale of abundance across species tends to differ by orders of magnitude. Second, many species exhibit very similar behavior.Among the controls, couldn't some species also vary over time, because of factors in that individual that change which are not specifically treatment?Setting missing data to 0 is generally bad practice, because then you can't distinguish true zeros from missingness. You should either do proper missing data imputation, or recommend methods that explicitly model the missign values / don't require measurements at equal timepoints.The different correlation structures you propose reflect an equispaced sampling design. It wouldn't be too hard to change the correlation structure to allow for unevenly spaced sampling, and it would address your point (4, "Asynchronous repeated measures").Could you create an interactive notebook? E.g., using binder:
https://mybinder.org/v2/gh/krisrs1128/microbiome_dasim_example/master. This would make it easier for people (esp. nonexperts) to get acquainted with your work, without having to install jupyter etc.For dosage effects, I'd find a (reversed) sawtooth or wavelet-style spike more believable than an oscillating function. But again, this is related to the point of letting people choose their own alternatives.Minor CommentsThe caption in Figure 5 seems deprecated.I don't think you ever defined "OTU".The library load should say "microbiome" not "microbime".There are still a few typos here and there (e.g., "differential abundant" features and "metrics of success results"), so I recommend another careful read.I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.Thank you for your careful review of the manuscript and suggestions. Responses to issues raised are shown below for specific points raised.WeaknessIn the vein of evaluating the robustness of the simulation in approximating reality we have included an additional section “Approximating Observed Microbiome Data” that aims to show how the current package could complement real-world microbiome data. Some of the implications and thought processes for using the simulation package in this setting are discussed within the details of this section.Missed OpportunitiesWe thank the reviewer for this comment. The metaSplines analysis that is included in the manuscript is meant to serve as an illustration of how the simulator could be used to evaluate longitudinal differential abundance methods. In the interest of focusing this software tools manuscript on the simulator package itself, a full comparison of different methods was not investigated. However, this would be a valuable avenue to explore in more depth in a subsequent write-up.Presently we are not aware of any interface within R that would dynamically allow users to draw functions. This would be highly useful and we would like to continue adding in different functional forms within the package. The currently available forms were an initial foray into some potentially relevant types of trends that might be observed. Users with R expertise can modify the mean_trend function to create alternative functional forms, but allowing full user specification may create an unintended burden for many practitioners. In the future, we will consider some alternative options that allow for higher flexibility while maintaining usability.DiscussionIn our simulation design we are restricting to a single feature of interest when generating data and therefore are inherently ignoring variability across species. This feature simulation can be tailored for individual species of interest and would be run separately in each case.The control group could also vary over time, but from a simulation perspective we are treating the design as if the sample has been norm referenced across time for the control group. Since the main goal of estimation is calculating the difference between the treatment and control group over time, restricting the control group to be invariant over time simplifies the user input and maintains the primary goal of estimation.By default when inducing missingness in the data, the values are treated as NA rather than 0. However, we included the option to specify the value of the missing data to represent cases where there may be some true non-zero occurrence but due to technical limitations such as read depth the values do not appear. The process of generating missingness is meant to align with some of the typical issues such as loss to follow-up when conducting these types of longitudinal designs.Thank you for this comment - as a result we have decided to expand the functionality to allow for asynchronous sampling over a specified interval (using asynch_time=TRUE) or alternatively to have the user specify discrete sampling times for each individual with the mvrnorm_sim_obs function. An example of using each of these asynchronous sampling schemes have been included in the updated manuscript. The compound and independent correlation structures remain unchanged in this unevenly spaced sampling design, but the AR(1) correlation structure now incorporates the amount of time between each sample as |t_{i}-t_{j}|.Thank you for this suggestion. The original instructions for installing and running Jupyter with an R kernel were indeed cumbersome. To make the notebook easily interactive, we have re-compiled the materials using Google Colab with a simple badge on top that will allow users to run the code without requiring local installation and setup of Jupyter.Thank you for pointing out these possible functional forms. We will work to expand the functional forms available to include these types of trends in the future. As mentioned earlier the ability to define the mean trend has a natural tradeoff between flexibility and useability.Minor CommentsCaption texts, grammatical errors, and typos pointed out have been corrected. Additional read throughs have also been performed to minimize these types of mistakes in the latest draft.This manuscript introduces a new method for simulating longitudinal differential abundance for microbiome data. The method is implemented as an R/Bioc package. The proposed package allows the user to simulate longitudinal microbiome data based on various assumptions, and allows the tuning of key design aspects such as signal-to-noise ratio, correlation structure, effect size and zero inflation. One of the available methods is validated with benchmarking comparisons.The manuscript is technically sound and written in a fluent and easily understandable English. Experiments and statistical analyses have been conducted rigorously. The source code and experiments are openly available via Github but I have not tried to replicate the analysis.Realistic simulations are valuable for study design, and help to address questions about sample size, density of time points, experimental costs, etc. The work provides pragmatic solutions to a topical problem in microbiome bioinformatics.Major comments:
Minor comments:The simulator provides versatile options to tune signal shape, correlations, and noise. However, I am left wondering how well the simulations correspond to real microbiome data. In particular, it is not clear nor validated how the time series shape and correlation structures correspond to known processes in microbial ecology, such as neutral process, competition models (such as generalized Lotka-Volterra), compositionally aware naive models (Dirichlet-Multinomial), mean-reversing processes (Ornstein-Uhlenbeck). All of these have ecological interpretations and have been visible in recent microbiome time series literature. These models are motivated by known ecological processes, rather than technical modifications on the signal shape; it would be relevant to know how large impact the chosen modeling assumptions might have on the results. Can we expect that the proposed simulator will yield qualitative similar conclusions, even if the connection to ecological mechanisms might be weak?The proposed model does not (explicitly) account for heteroschedasticity or overdispersion, and its performance has not been demonstrated with recently popular models of differential abundance, such as DESeq2. It could be true that longitudinal testing of differential abundance requires different methodology. But longitudinal simulators can be also used to simulate cross-sectional data, which is always a snap-shot of longitudinal data. I wonder if the simulator would perform well with standard methods for cross-sectional data; or if it can be shown to yield similar overall distributions. This could provide some additional support for the simulations as the feasibility of the modeling assumptions and their impact on the conclusions remains open.Other simulators for microbiome data and time series are available. One that I am aware of is the seqtime package (
https://github.com/hallucigenia-sparsa/seqtime), although that is only available as an R package (and not formally published), but there may be other recent simulators. I did not find other simulation works being cited, it would be good to check if other simulators can be identified in the recent literature, and how they relate to this work.Lack of integration with phyloseq is a weakness, as this class structure is now very popular among the microbiome R users, and many tools build directly on that class structure. It would be useful addition to the package if the simulations could be made available in a phyloseq format.I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.Thank you for your review of the manuscript and suggestions for improvement. Both the manuscript and package have been updated to reflect issues raised above. In the following we address point wise specific comments raised from Version 1 of the manuscript.Major Comments:(1) Thank you for this comment. As an additional step to address the ability of the simulator to reflect real microbiome data we have provided an example of approximating clinical data with longitudinal microbiome data in mice from Turnbaugh et. al, 2009. This section was added to the manuscript under “Approximating Observed Microbiome Data” with further details about how the simulator can be used to complement and expand clinical efforts.In particular, we outline some of the steps to consider when constructing a simulated dataset to approximate a real-world study. Although our simulation design does not explicitly account for ecological processes as mentioned, the focus on the underlying distributional assumption defines the scope of problems which can be addressed.The simulator looks to construct values for a single feature (aggregated at the taxonomic level of interest) and thus does not incorporate correlation between features or compositional constraints. By focusing on only single features of interest we expect that the simulator will yield similar conclusions to those observed in clinical experiments, and thus offers practitioners a useful tool when designing or expanding a longitudinal microbiome study.(2) During the construction of the simulator the variance between both groups is held constant, partly in order to reduce the burden of parameter specification on the user. This choice also reflects a belief that the two groups differ only in their mean trend over time, which is often an appropriate default assumption without particular beliefs about how the heteroskedasticity may differ by group over time. However, it is worthwhile to consider adding a heteroskedastic option to the simulator to incorporate potential differences in noise between groups. While the goal of the simulator focuses on longitudinal designs, it is worthwhile to explore its applicability to cross-sectional data. The simulator function can simulate cross-sectional data by setting num_timepoints=1. Further evaluation of the performance in these cases is merited, but falls outside the scope of this initial software tools manuscript.Minor Comments:(1) We thank the reviewer for pointing to these additional simulator packages. A further investigation of the literature returned multiple packages including seqtime, untb, and WrightFisher with similar goals for simulating longitudinal trends. These packages however focus on simulations from a compositional perspective rather than at a single feature level, and lack some of the documentation and formal publication that accompanies our present package. I have updated the manuscript to include references to these additional packages and note some of the differences in the conclusion.(2) Thank you for this comment. We have added additional conversion functions simulate2MRexperiment and simulate2phyloseq that format simulated data into the respective objects of interest for the metagenomeSeq and phyloseq packages. We have also added details about using these functions within the manuscript.
Authors: Aleksandar D Kostic; Dirk Gevers; Heli Siljander; Tommi Vatanen; Tuulia Hyötyläinen; Anu-Maaria Hämäläinen; Aleksandr Peet; Vallo Tillmann; Päivi Pöhö; Ismo Mattila; Harri Lähdesmäki; Eric A Franzosa; Outi Vaarala; Marcus de Goffau; Hermie Harmsen; Jorma Ilonen; Suvi M Virtanen; Clary B Clish; Matej Orešič; Curtis Huttenhower; Mikael Knip; Ramnik J Xavier Journal: Cell Host Microbe Date: 2015-02-05 Impact factor: 21.023
Authors: Peter J Turnbaugh; Vanessa K Ridaura; Jeremiah J Faith; Federico E Rey; Rob Knight; Jeffrey I Gordon Journal: Sci Transl Med Date: 2009-11-11 Impact factor: 17.956
Authors: Tanya Yatsunenko; Federico E Rey; Mark J Manary; Indi Trehan; Maria Gloria Dominguez-Bello; Monica Contreras; Magda Magris; Glida Hidalgo; Robert N Baldassano; Andrey P Anokhin; Andrew C Heath; Barbara Warner; Jens Reeder; Justin Kuczynski; J Gregory Caporaso; Catherine A Lozupone; Christian Lauber; Jose Carlos Clemente; Dan Knights; Rob Knight; Jeffrey I Gordon Journal: Nature Date: 2012-05-09 Impact factor: 49.962
Authors: Alison Morris; Joseph N Paulson; Hisham Talukder; Laura Tipton; Heather Kling; Lijia Cui; Adam Fitch; Mihai Pop; Karen A Norris; Elodie Ghedin Journal: Microbiome Date: 2016-07-08 Impact factor: 14.650