Literature DB >> 23760464

Random sampling process leads to overestimation of β-diversity of microbial communities.

Jizhong Zhou¹, Yi-Huei Jiang, Ye Deng, Zhou Shi, Benjamin Yamin Zhou, Kai Xue, Liyou Wu, Zhili He, Yunfeng Yang.

Abstract

The site-to-site variability in species composition, known as β-diversity, is crucial to understanding spatiotemporal patterns of species diversity and the mechanisms controlling community composition and structure. However, quantifying β-diversity in microbial ecology using sequencing-based technologies is a great challenge because of a high number of sequencing errors, bias, and poor reproducibility and quantification. Herein, based on general sampling theory, a mathematical framework is first developed for simulating the effects of random sampling processes on quantifying β-diversity when the community size is known or unknown. Also, using an analogous ball example under Poisson sampling with limited sampling efforts, the developed mathematical framework can exactly predict the low reproducibility among technically replicate samples from the same community of a certain species abundance distribution, which provides explicit evidences of random sampling processes as the main factor causing high percentages of technical variations. In addition, the predicted values under Poisson random sampling were highly consistent with the observed low percentages of operational taxonomic unit (OTU) overlap (<30% and <20% for two and three tags, respectively, based on both Jaccard and Bray-Curtis dissimilarity indexes), further supporting the hypothesis that the poor reproducibility among technical replicates is due to the artifacts associated with random sampling processes. Finally, a mathematical framework was developed for predicting sampling efforts to achieve a desired overlap among replicate samples. Our modeling simulations predict that several orders of magnitude more sequencing efforts are needed to achieve desired high technical reproducibility. These results suggest that great caution needs to be taken in quantifying and interpreting β-diversity for microbial community analysis using next-generation sequencing technologies. IMPORTANCE Due to the vast diversity and uncultivated status of the majority of microorganisms, microbial detection, characterization, and quantitation are of great challenge. Although large-scale metagenome sequencing technology such as PCR-based amplicon sequencing has revolutionized the studies of microbial communities, it suffers from several inherent drawbacks, such as a high number of sequencing errors, biases, poor quantitation, and very high percentages of technical variations, which could greatly overestimate microbial biodiversity. Based on general sampling theory, this study provided the first explicit evidence to demonstrate the importance of random sampling processes in estimating microbial β-diversity, which has not been adequately recognized and addressed in microbial ecology. Since most ecological studies are involved in random sampling, the conclusions learned from this study should also be applicable to other ecological studies in general. In summary, the results presented in this study should have important implications for examining microbial biodiversity to address both basic theoretical and applied management questions.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2013 PMID： 23760464 PMCID： PMC3684833 DOI： 10.1128/mBio.00324-13

Source DB: PubMed Journal: MBio Impact factor: 7.867

Introduction

Microorganisms appear to be the most diverse group of life presently known, inhabiting almost every imaginable environment on Earth (1). They play integral and unique roles in ecosystem functions and biogeochemical cycling of carbon (C), nitrogen (N), sulfur (S), phosphorus (P), and various metals. Understanding the structure, functions, interactions, stability, and adaptations of microbial populations/communities is crucial for basic science discovery (2, 3), biotechnology (4), agriculture (5), energy (6), the environment (7), and human health (8). However, due to their extremely high diversity and as-yet-uncultivated status, characterizing microbial diversity and establishing the linkages between microbial diversity and ecosystem function are very challenging (9). Understanding the mechanisms controlling microbial diversity and functions is even more difficult. The recent advances in metagenomics, which has emerged as a cutting-edge 21st century science (10), and associated metagenomics technologies such as high-throughput sequencing and microarrays (10, 11) provide revolutionary tools to address these challenges. Large-scale high-throughput sequencing-based metagenomics is providing unprecedented views of the taxonomic diversity, metabolic potential, and ecological roles of microbial communities in various habitats (12–32). Various studies clearly demonstrate that large-scale sequencing approaches are powerful in studying microbial community diversity and activity (23, 33–43). β-Diversity, the site-to-site variability in species composition, is crucial in understanding patterns of species diversity across various spatial and temporal scales. It can provide insight on the mechanisms controlling community compositions and structure (44). β-Diversity is widely used for investigating the mechanisms controlling biodiversity (45–47) and the responses of biological communities to environmental changes (45). In microbial ecology, high-throughput metagenomics sequencing and associated technologies are the major tools for examining the site-to-site variability in species composition and its response to environmental changes (48, 49). However, the amplicon-based approach has shown limited reproducibility, especially when examining low-abundance taxa (50). For instance, based on the overlap of operational taxonomic units (OTUs), a very low reproducibility (<20% between two technical replicates and <10% among three technical replicates) was obtained (50), which is far away from the theoretical expectation of 100% overlap among technical replicates. Similar results were recently obtained with mock communities (51) and human microbiomes (13). The high numbers of variations in technical replicates are most likely due to the sampling artifacts associated with random sampling processes (50, 52), because many steps in the pyrotag-based sequencing analysis are associated with random sampling, e.g., PCR amplification of target genes, ligation of amplified PCR products to sequencing adaptors, emulsion and immobilization of beads, and bead deposition (50). Since microbial communities under natural settings are extremely complex and generally have abundance curves with very long tails, that is, large portions of OTUs exist in extremely low abundance, the probability of sampling such rare OTUs in a sampling event is low. The chances of resampling them are even lower, especially with limited sampling efforts (i.e., percentages of total individuals sampled in a community). It is expected that the severity of such sampling artifacts on community comparison is dependent on community complexity and sampling efforts. As the complexity of a microbial community increases, such an artifact will become more severe. Increasing sampling efforts will help to ameliorate such a problem (50). However, there is no theoretical foundation to support such a speculation. In this study, we hypothesize that a random sampling process is the main cause for high numbers of variations among technical replicates. To test this hypothesis, the main objective of this study is to provide a theoretical foundation on understanding such sampling artifacts associated with a random sampling process. We first developed a theoretical framework to simulate the random sampling processes based on general sampling theory and to predict sampling efforts for achieving a desired reproducibility. We then illustrated the sampling artifacts associated with random sampling processes using an analogous example. We also examined whether the developed framework could be used to predict the low percentage of overlap of OTUs among technical replicates. Our results indicated that high numbers of variations in technical replicates were due to the artifacts associated with random sampling processes.

Mathematical framework. (i) Sampling individuals from a large regional community.

The pattern of species abundances across different spatial-temporal scales is a central issue in ecology. However, it is generally impossible to directly measure the abundance of all species at ecologically relevant scales. Thus, it is important to understand the relationship between the underlying species abundance distribution of a large regional community and the observed abundance distribution in a small sample from the large regional community. Various statistical sampling theories are developed to describe the relationships of the species abundance between the sampled community and the large regional community (53). Here, we will use general sampling theory to simulate and predict the sampling artifacts associated with random sampling processes. Assuming that a species (or OTU) occurs in a large regional community, the number of its individuals that exist in a small sample of the large community is dependent on the total abundance of that species in the large community, the size of the sample, and the spatial distribution of the individuals (54). In this study, we assume that all individuals in the large regional community are located randomly in space. Let N represent the total number of individuals (e.g., 16S rRNA gene sequences) and n be the number of different species (e.g., OTUs), each with abundances x, x, …, x. If an individual is randomly sampled from the community, the probability of it belonging to the ith species is x/N. If m individuals are randomly sampled from this large community, the expected number of individuals from the ith species is mx/N. With replacement, then, the probability of the ith species with abundance x to be encountered by k individuals in the sample is given by the following binomial function expression (55, 56): To estimate the probability that at least one individual of the ith species is present in the sample, we generally calculate the probability that the ith species is absent in the sample, i.e., k = 0. If k = 0, then the probability of the ith species with abundance x in a community to be absent in the sample is expressed as Let x/N equal ax/m, where a = m/N, the sampling ratio. Then, equation 2 is approximated to the exponential form of the Poisson distribution: According to the Poisson distribution, the probability of the ith species with abundance x in the community to have at least one individual in the sample can be expressed as follows (54): The Poisson distribution is the simplest model for sampling individuals from a large regional community. Based on general sampling theory (53, 56), the abundance distribution observed in a sample that constitutes a proportion a of the large regional community can be expressed as where φ(m) is the observed species abundance distribution in a sample with m individuals sampled. φ(x) represents the species abundance distribution with abundance x in the large regional community, in which θ is the vector of parameters (57).

(ii) Expected species overlap among samples with the size of the large community known.

Given a large community with a known underlying species abundance distribution, φ(x), the expected proportion of species in common between samples can be theoretically predicted. In the case in which the species abundances in these samples are perfectly correlated, the expected number of the species shared between two samples is given by (54) and the number of species shared among three samples is where a1, a2, and a3 are sample ratios from samples 1, 2 and 3, and a1 = m1/N,l = 1, 2, 3. m1 is the number of individuals (i.e., 16S rRNA gene sequences in this study) obtained from the lth sample. Various similarity metrics are used for assessing overlap among different samples (58). In this study, the two popular similarity metrics, Jaccard’s incidence-based and Bray-Curtis’s abundance-based methods, are used. Based on the Jaccard similarity index, the proportion of species (i.e., OTUs in this study) overlap between two samples is calculated as follows: The proportion of species overlap among three samples is where There are seven commonly used continuous species abundance distributions, including lognormal, exponential, gamma, truncated hyperbolic, continuous log series (54), inverse gamma, and inverse Gaussian distribution. When the total number of individuals of the regional community (N) is known, the Jaccard similarity-based explicit expression functions of the proportion of species overlap between two or three samples are summarized in Tables S1A and S1B in the supplemental material.

(iii) Expected species overlap among samples with the size of the large community unknown.

Using equations 8 and 9 to estimate the percentages of species overlap between two or three samples requires information about the total individuals (N) of the communities examined. However, in most cases, the number of total individuals in a community is unknown. In the following, we will consider the situation in which N is unknown. Since most of the species abundance distributions are scale invariant under a Poisson sampling process (53, 57, 59, 60), the expected sample abundance distribution can be obtained by rescaling the community abundance distribution (x) to the sample abundance distribution (y): y = px, where p is the proportion of the community sampled. Thus, equation 8 can be rewritten as where, and θ* are the sample ratios and the vector of parameters when N is unknown, which are the rescaled parameters and functions of p. In the case of two random samples from the same community with parameters θ, we set p = a1 + a2 and then obtain = a1/(a1 + a2) and = a2/(a1 + a2). Let m1 and m2 be the total number of individuals observed in sample 1 and sample 2, respectively. Now we can substitute a = m1/N and a = m2/N into and, and then both and are the functions of m1 and m2, i.e., = m1/(m1 + m2) and = m2/(m1 + m2). θ* is the vector of parameters of the species abundance distribution in the samples. Consequently, the expected species overlap can be obtained by fitting data to the parameters θ* without knowing the total number of individuals of the community, N. Similarly, in the case of three samples randomly sampled from one community, the parameter p can be set to the sum of sampling ratios, p = a1 + a2 + a3. Then, equation 9 can be rewritten as where and = a1/(a1 + a2 + a3) = m1/(m1 + m2 + m3), = a2/(a1 + a2 + a3) = m2/(m1 + m2 + m3), = a3/(a1 + a2+ + a3) = m3/(m1 + m2 + m3). Now, according to equations 10 and 11, the species overlap for two and three samples can be estimated based on sample abundance y rather than community abundance x, so that there is no need to know the community size N. When the total number of individuals of the regional community is unknown, the Jaccard similarity-based explicit expression functions of the proportion of species overlap between two or among three samples are summarized in Tables S1C and S1D in the supplemental material.

(iv) Predicting sampling efforts for achieving a desired overlap among replicate samples.

One important question in practice is what levels of sampling efforts are needed for achieving desired species overlap when the number of total individuals of the community is unknown. In the following, we will address this practical question by assuming that and and individuals are needed for sampling to achieve a desired overlap between two samples. To simplify the situation, let = = , and sampling ratio A = /N. Based on equation 13, when N is known, the predicted overlap between two samples is given by Similarly, when N is unknown, we can take the transformation of the abundance variable y = px = (a1 + a2)x. By comparing equations 8 and 10, the relationship of the expected number of the shared species when N is known or unknown is given by where is the predicted sample ratio for achieving the desired overlap between two samples, and θ* is the vector of scaled parameters. Rearranging A* as the expression of the number of sequences, A* = m′/(m1 + m2), the predicted overlap model is given by Similarly, the predicted overlap model for three samples is given by By solving equations 15 and 16, the sampling efforts m′ can be estimated based on sample abundance y without knowing the community size N. The Jaccard similarity-based explicit expression functions of the proportion of species overlap between two or among three samples for predicting sampling efforts are summarized in Tables S1C and S1D in the supplemental material.

RESULTS

Simulation with an analogous example.

To better illustrate the effects of random sampling processes on the OTU overlap among technical replicates (3), we use an analogous example by randomly sampling balls from three jars containing the same number of balls of different colors (Fig. 1). Assume that three identical jars contain N balls of n different colors. Their abundance distributions vary among different colors of balls but are identical among these jars. Here, individual balls are equivalent to individual 16S rRNA gene sequences, while the types of balls with different colors are equivalent to individual OTUs. To simplify the situation, we assume that the same numbers of balls, m, are randomly sampled from these three jars, yielding samples 1, 2, and 3 (Fig. 1). Theoretically, if all balls from the whole jars are sampled (i.e., m = N), 100% ball overlap will be expected among these samples. However, in reality, the percentages of ball overlap will be less than 100%, because they depend on the sampling efforts, ball abundance distribution, and complexity of the community. The differences of the percentages of ball overlap between the theoretically predicted and the observed values among different sampling events are entirely due to random sampling processes, because there is no difference in the ball compositions and abundances among these three jars.

FIG 1

An analogous example to simulate random sampling processes. Three identical jars contain the same number and types of balls, with identical ball abundance distribution.

An analogous example to simulate random sampling processes. Three identical jars contain the same number and types of balls, with identical ball abundance distribution. With the assumption that the ball abundance distributions in these three jars follow any of the five continuous species abundance distributions as listed in Table S1A in the supplemental material, we simulated the effects of the random sampling processes on the ball overlap based on Jaccard and Bray-Curtis indexes for five different distributions: exponential, gamma, lognormal, inverse Gaussian, and inverse gamma. Under each specific ball abundance distribution, the average observed overlap through simulations of 100 repeated samplings (see Materials and Methods for details) was calculated and compared to the theoretically predicted overlap between two samples based on equation 8 or 10, when N is known or unknown, respectively. Similar analyses were carried out for comparing ball overlaps among three samples based on equation 9 or 11. Although the simulation results vary considerably with the parameters selected, the following generalizations can be drawn. First, no significant differences of Jaccard similarities were observed between the theoretically predicted and the observed overlap values for all five species abundance distributions (Table 1 and Fig. 2; see also Fig. S1 and S2 in the supplemental material), indicating that the theoretically predicted percentages of ball overlap match well to the observed percentages of ball overlap for all ball abundance distributions examined. Second, there were no significant differences of the predicted overlap percentages derived from equation 8 with known N and equation 10 with unknown N for two samples or from equation 9 with known N and equation 11 with unknown N for three samples (see Table S2 in the supplemental material), suggesting that accurate predictions of the overlap percentages between two or among three samples can be obtained when the size of the regional communities is unknown. In addition, low ball overlap percentages were observed with low sampling efforts under different ball abundance distributions (Fig. 2; see also Fig. S1 and S2). For instance, under exponential distribution, when 1% of the community was sampled, 50% overlap between two samples (Fig. 2A) and 34% overlap among three samples (Fig. 2B) were obtained. As the sampling efforts increase, the ball overlap among samples increases under different ball abundance distributions (Fig. 2; see also Fig. S1 and S2). For example, under exponential distribution, when 10% of the community was sampled, 91% ball overlap between two samples (Fig. 2A) and 82% among three samples were obtained (Fig. 2B). All of the above results suggested that accurate predictions of the overlap could be obtained between two samples with equation 8 or 10 and among three samples with equation 9 or 11.

TABLE 1

Chi-square-based goodness-of-fit test of the observed and predicted percentages of overlap for the analogous example[]

OTU abundancedistribution	Two samples				Three samples
	Known N		Unknown N		Known N		Unknown N
	χ²	P	χ²	P	χ²	P	χ²	P
Exponential	6.6 × 10⁻⁵	0.999	6.6 × 10⁻⁵	0.999	2.7 × 10⁻⁴	0.999	2.7 × 10⁻⁴	0.999
Gamma	1.9 × 10⁻⁶	0.999	5.1 × 10⁻⁴	0.999	9.4 × 10⁻⁶	0.999	6.5 × 10⁻³	0.999
Lognormal	4.2 × 10⁻⁴	0.999	2.1 × 10⁻⁴	0.999	1.1 × 10⁻³	0.999	1.1 × 10⁻³	0.999
Inverse gamma	4.0 × 10⁻⁶	0.999	2.1 × 10⁻⁴	0.999	6.5 × 10⁻⁶	0.999	2.2 × 10⁻³	0.999
Inverse Gaussian	3.3 × 10⁻⁵	0.999	9.1 × 10⁻⁴	0.999	7.5 × 10⁻⁵	0.999	8.5 × 10⁻⁴	0.999

Detailed information is presented in Fig. S2 and S3 in the supplemental material.

FIG 2

The relationships between the expected Jaccard overlaps of ball colors and sampling efforts under the exponential abundance distribution, assuming the community has 106 individual balls and 104 types of balls, with different colors. Distribution parameter is set to λ = 1 × 10−2. In each case, we calculated the theoretically predicted overlap (blue line) by equation 8 when N is known, the predicted overlap (red line) by equation 10 when N is unknown, and the average observed overlap (point) through simulations of 100 repeated samplings. (A) Two samples. The sample ratio is a1 = a2. (B) Three samples. The sample ratio is a1 = a2 = a3.

Chi-square-based goodness-of-fit test of the observed and predicted percentages of overlap for the analogous example[] Detailed information is presented in Fig. S2 and S3 in the supplemental material. The relationships between the expected Jaccard overlaps of ball colors and sampling efforts under the exponential abundance distribution, assuming the community has 106 individual balls and 104 types of balls, with different colors. Distribution parameter is set to λ = 1 × 10−2. In each case, we calculated the theoretically predicted overlap (blue line) by equation 8 when N is known, the predicted overlap (red line) by equation 10 when N is unknown, and the average observed overlap (point) through simulations of 100 repeated samplings. (A) Two samples. The sample ratio is a1 = a2. (B) Three samples. The sample ratio is a1 = a2 = a3. Because the general explicit form of the ball overlap for quantitative similarity index could not be derived, numerical simulations were performed (see Materials and Methods) for all five ball abundance distributions (see Fig. S3 in the supplemental material) for comparisons based on Jaccard similarity, all of the numerical simulations indicated that low percentages of overlap were also observed among samples with low sampling efforts. For instance, under exponential distribution, when 1% of the community was sampled, 53% of ball overlap based on the Bray-Curtis index was obtained between two samples and 35% among three samples. All of these results, based on both Jaccard and Bray-Curtis similarities, indicate that substantially high numbers of variations can be obtained among communities of identical compositions and abundance distributions when sampling effort is low. Such variation is entirely due to random sampling processes.

Empirical examples.

In our previous studies, the composition and structure of 24 microbial communities from a long-term global change experimental site in Oklahoma were analyzed using the amplicon-based sequencing detection approach (49). Each community was amplified with 2 or 3 bar-coded primers, followed by sequencing using both forward and reverse primers. Thus, three types of datasets were available: sequences from forward and reverse primers and combined sequences. Since the sequences of individual technical replicates within an experimental plot are derived from the same community, conceptually, they should obey the same species abundance distribution. Thus, the sequences from all technical replicates within a plot (a community) were pooled for fitting species abundance models described in Table S1A in the supplemental material. The best-fit model and associated parameters for each plot (i.e., community) were shown in Table S3A for the soil communities amplified with two tags and in Table S3B for the soil communities amplified with three tags. Either the exponential abundance distribution or the inverse gamma distribution is the preferred OTU abundance distribution for all soil samples (see Tables S3A and S3B). Once the OTU abundance distribution models are determined for all communities examined, the predicted overlap percentages in terms of Jaccard and Bray-Curtis similarities were calculated based on the formula provided in Tables S1C and S1D in the supplemental material. The results were listed in Tables S4A and S4B for the communities amplified with two tags and in Tables S5A and S5B for those with three tags. Overall, no significant differences were observed between observed and predicted Jaccard overlaps (Table 2).

TABLE 2

Chi-square-based goodness-of-fit test of the observed and predicted percentages of overlap for the experimental data[]

Communitiesamplified with:	Similaritytest	Forward primer		Reverse primer		Combined
Communitiesamplified with:	Similaritytest	χ²	P	χ²	P	χ²	P
Two tags	Jaccard	0.040	0.999	0.060	0.999	0.038	0.999
Two tags	Bray-Curtis	0.106	0.999	0.122	0.999	0.075	0.999
Three tags	Jaccard	0.006	0.999	0.010	0.999	0.005	0.999
Three tags	Bray-Curtis	0.008	0.999	0.026	0.999	0.016	0.999

Detailed data are listed in Table S4A (two tags, Jaccard), S4B (two tags, Bray-Curtis), S5A (three tags, Jaccard), and S5B (three tags, Bray-Curtis) in the supplemental material.

Chi-square-based goodness-of-fit test of the observed and predicted percentages of overlap for the experimental data[] Detailed data are listed in Table S4A (two tags, Jaccard), S4B (two tags, Bray-Curtis), S5A (three tags, Jaccard), and S5B (three tags, Bray-Curtis) in the supplemental material. Since Bray-Curtis overlap is a quantitative similarity index, it is generally higher than the incidence-based overlap (see Tables S5A and S5B in the supplemental material). Overall, no significant differences were observed between observed and predicted Bray-Curtis overlap percentages (Table 2). All of the above-reported results indicated that the predicted overlap percentages match well to the observed results. Therefore, the variations among technical replicates can be best explained by the artifacts associated with random sampling processes.

Predictions of sampling efforts for the desired species overlap.

Based on equations 15 and 16, on the explicit expression functions presented in Tables S1C and S1D, and on sample abundance and sequence information for various tags in Tables S3A and S3B, sampling efforts for achieving various degrees of OTU overlaps between two (see Table S6A) and among three technical replicates (see Table S6B) were predicted. To achieve 20% OTU overlap for two tags, an average of 2,367 sequences are needed (see Table S6A; Fig. 2A, 2UC), which is consistent with the number of sequences obtained in this experiment. To achieve 90% overlap between two technical replicates, an average of 60,900 sequences are needed (see Table S6A). For instance, for the community of 2UC, a total of 71,400 of the combined sequences would be needed to achieve 90% OTU overlap between two technical replicates (Fig. 3A), which is about 32 times more sequence reads needed than we sequenced previously (see Table S4 to S6).

FIG 3

Prediction of sampling efforts for desired OTU overlap. (A) Desired overlap between two tags based on the combined sequences from sample 2UC. The sampling efforts were calculated based on equation 15. The parameters for species abundance distribution were from Table S3A in the supplemental material. (B) Desired overlap among three tags based on the combined sequences from sample 1UC. The sampling efforts were calculated based on equation 16. The parameters for species abundance distribution were from Table S3B. Much more sequencing effort is needed to achieve desired OTU overlaps among three technical replicates than between two technical replicates. To have 10% OTU overlap for three tags, an average of 3,310 sequences are needed, which is consistent with the number of sequences obtained in this experiment. To reach 90% overlap between three technical replicates, an average of 63,770 sequences are needed (see Table S6B). For example, for the community of 1UC, about 60,500 sequences are required to obtain 90% of OTU overlap (Fig. 3B). The current sampling efforts (2,018 sequences) are far less than the desired 90% OTU overlap. Our results also suggested that most of the work published, especially with soils, is severely undersampling if the goal is to determine significant changes in β-diversity among sampling sites.

DISCUSSION

One of the major technical challenges for the amplicon-based sequencing detection approach is low reproducibility (50), which is a central issue in comparative studies (61). This issue has recently been examined intensively, but it is still a matter of debate (13, 31, 51). Results from several recent studies supported (13, 50, 51, 62), disputed (31, 63–65) or both supported and disputed (66) our previous observations. A number of factors can contribute to such divergent observations, e.g., the complexity of the systems examined (50, 51, 64), differences in sequencing depths (50, 62, 63), and/or variations in sequencing and sequence preprocessing approaches (51, 67). Since most of these factors act with each other, isolating individual factors influencing sequencing reproducibility is extremely difficult, especially when natural communities of unknown diversity background are examined. Using artificial communities of known diversity, Pinto and Raskin (51) provided explicit evidences for the poor reproducibility of the amplicon sequencing-based detection approach, even with the simple community and relatively deep sequencing. Thus, poor reproducibility is a problem inherent in the amplicon sequencing-based detection approach (50). Such poor reproducibility among technical replicates could result from PCR amplification biases (67–72), sequencing errors (51, 67, 71, 72), and/or the artifacts associated with random sampling processes during sample preparation and sequencing (50, 52). In this study, using analogous ball examples, we showed that very low percentages of overlap were observed among replicate samples from the community with identical ball types and numbers when the sampling effort was low, which is very consistent with what we observed experimentally (50). In addition, under different OTU abundance distributions, the obtained overlap percentages among two or three technical replicates were very consistent with the theoretical predicted values under the assumption of random sampling processes. The simulation results presented in this study provided explicit evidences of the contributions of random sampling processes to the high numbers of variations observed among technical replicate samples. It should be noted that spurious OTUs due to sequencing errors could also contribute to such technical variations, but they will be indistinguishable from rare abundant OTUs which can be detected only sporadically in replicate samples (51). The low reproducibility of the amplicon sequencing-based detection approach associated with random sampling processes raises a concern of comparing the β-diversity of microbial communities across different samples for amplicon sequencing. Although the inherent high numbers of variations associated with random sampling processes have less effect on α-diversity, it is problematic in estimating β-diversity (50, 51), which presents a significant challenge for comparative studies across different spatial and temporal scales. However, the degrees of such effects on estimating β-diversity are dependent on the complexity of the community examined and sampling efforts. In general, as the complexity increases, such problems will be more severe (50) and greater sampling efforts will be needed (50, 62). Nevertheless, great caution needs to be taken for estimating community based on OTUs when the sequences examined are not enough to represent community diversity (51, 62). Determining the patterns and distribution of species abundance is a central issue in ecology (53), because they are important in studying both basic ecological theory and applied biodiversity conservation. However, direct measurement of species abundance in ecologically relevant scales is difficult, if not impossible. Instead, the distribution of species abundance in ecology is generally inferred based on limited samples under Poisson sampling. Therefore, the artifacts associated with random sampling processes observed in this study for microbial communities should also be applicable to the other ecological studies in plants and animals, although such problems could be less severe in macroecology. However, to the best of our knowledge, such an issue has not been addressed in the ecological literature, but it is of critical importance in studying species distribution, especially species-area relationships (52, 73). Along with our previous efforts (50, 52), this study clearly demonstrates the importance of random sampling processes in estimating microbial biodiversity, especially β-diversity. The general conclusions learned from this study should be applicable to other ecological studies in general. In conclusion, the factors causing variations in β-diversity are among the most important but poorly understood issues in ecology, because they are the key mechanisms influencing global variation in biodiversity. Next-generation sequencing technologies, such as PCR amplicon sequencing-based detection approaches, have been rapidly used for characterizing microbial biodiversity, but they also suffer from several inherent drawbacks, such as a high number of sequencing errors, biases, and poor reproducibility and quantitation. Through mathematical modeling and simulation based on general sampling theory, this study provides explicit evidences of random sampling processes as the main factor causing high percentages of technical variations and develops a framework for predicting sampling efforts for achieving the desired technical reproducibility. Since most ecological studies are involved in random sampling, the artifacts associated with random sampling processes observed in microbial ecology should also be applicable to macroecology, although such problems could be less severe there. Because such artifacts greatly overestimate β-diversity, great caution should be taken when the amplicon sequencing-based detection is used for drawing quantitative conclusions about β-diversity. Increasing sampling efforts and/or the number of sample replicates (both technical and biological) should be the most effective ways to ameliorate technical reproducibility for drawing more reliable quantitative conclusions, but how to balance the sampling efforts and number of samples analyzed per sequencing run is dependent on biological questions and objectives, as well as the complexity and similarity of communities examined.

MATERIALS AND METHODS

The details for all materials and methods used in this study are provided in the supplemental material (see Text S1). Briefly, based on general sampling theory (54–56), a mathematical framework was first developed under 7 different OTU abundance distributions for simulating the effects of random sampling processes on quantifying β-diversity when the community size is known or unknown. Second, an analogous ball example was used to explicitly illustrate the effects of random sampling processes on the OTU overlap among technical replicates. Third, the theoretical models were fitted with the empirical experimental sequencing data from our previous studies (50), in which a total of 24 soil communities from a long-term climate change experiment facility (74) were sequenced with 60 tags. An average of 1,121 ± 390 OTUs were obtained for each tag based on the combined samples. In addition, a χ2 test was employed to determine whether the predicted OTU overlaps were consistent with the observed values. Supplemental Materials and Methods. Download Text S1, DOCX file, 0.1 MB The relationships between the expected Jaccard overlap of ball colors and sampling efforts for two samples under different abundance distributions. In this hypothetical example, five different community abundance distributions are considered. In each case of abundance distributions, two samples with the same sampling ratio (a1 = a2) were selected from the same community with 106 individuals and 104 types of balls of different colors. In each sampling event, we calculated the theoretically predicted Jaccard overlap (blue line) by equation 8, predicted overlaps (red line) by equation 10, and the average Jaccard observed overlap (dots) of 100 repeated simulations. Distribution parameters are set as follows: gamma distribution, α = 5, β = 20 (A); lognormal distribution, μ = 3.912, σ = 1.2 (B); inverse Gaussian distribution, μ′ = 100, λ′ = 100 (C); inverse gamma distribution, α′ = 6, β′ = 500 (D). Download Figure S1, DOCX file, 0.4 MB The relationships between the expected Jaccard overlap of ball colors and sampling efforts for three samples under different abundance distributions. In each sampling event, we calculate the theoretically predicted Jaccard overlap (blue line) by equation 9, predicted overlaps (red line) by equation 11, and the average Jaccard observed overlap (point) of 100 repeated simulations. (A) Gamma distribution; (B) lognormal distribution; (C) inverse Gaussian distribution; (D) inverse gamma distribution. Download Figure S2, DOCX file, 0.4 MB The relationships between the expected Bray-Curtis overlap of ball colors and sampling efforts for two and three samples under different abundance distributions. Simulations for showing the performance of Bray-Curtis overlap (see Materials and Methods). (A) Exponential distribution for two samples; (B) gamma distribution for two samples; (C) lognormal distribution for two samples; (D) inverse Gaussian distribution for two samples; (E) inverse gamma distribution for two samples; (F) exponential distribution for three samples; (G) gamma distribution for three samples; (H) lognormal distribution for three samples; (I) inverse Gaussian distribution for three samples; (J) inverse gamma distribution for three samples. Download Figure S3, DOCX file, 0.9 MB (A) Summary of the explicit expression functions of Jaccard similarity between two samples under different continuous species abundance distributions when the community size is known; (B) summary of the explicit expression functions of Jaccard similarity among three samples under different continuous species abundance distributions in the case of known community size; (C) summary of the expression functions of Jaccard () based on equation 10 under different continuous species abundance distributions in the case of unknown community size, as well as the expression functions of Jaccard similarity () for predicting sampling effort based on equation 15; (D) summary of the expression functions of Jaccard similarity among three samples () based on equation 11 under different continuous species abundance distributions in the case of community size unknown, as well as the expression functions of Jaccard similarity () for predicting sampling efforts based on equation 16. Table S1, DOCX file, 0.1 MB. Chi-square-based goodness-of-fit test of the predicted percentages of overlap for the analogous example based on equations 8 and 10 or equations 9 and 11. Table S2, DOCX file, 0.1 MB. (A) Best-fitting of OTU abundance distribution and parameter estimation by pooling the sequences together from two tags of each sample; (B) best-fitting of OTU abundance distribution and parameter estimation by pooling the sequences together from three tags of each sample. Table S3, DOCX file, 0.1 MB. (A) The observed and predicted overlaps by equation 10 based on Jaccard similarity between two tags. The OTU abundance distributions and associated parameters used for prediction are from Table S3A in the supplemental material. (B) The observed and predicted overlaps based on Bray-Curtis similarity between two tags. Table S4, DOCX file, 0.1 MB. (A) The observed and predicted overlaps based on Jaccard similarity among three tags. The OTU abundance distributions and associated parameters used for prediction were from Table S3B in the supplemental material. (B) The observed and predicted overlaps based on Bray-Curtis similarity among three tags. Table S5, DOCX file, 0.1 MB. (A) Sampling efforts needed for achieving desired Jaccard overlap between two technical replicates; (B) sampling efforts needed for achieving desired Jaccard overlap among three technical replicates. Table S6, DOCX file, 0.1 MB.

66 in total

1. Acclimatization of soil respiration to warming in a tall grass prairie.

Authors: Y Luo; S Wan; D Hui; L L Wallace
Journal: Nature Date: 2001-10-11 Impact factor: 49.962

2. Theoretical ecology for engineering biology.

Authors: Thomas P Curtis; Ian M Head; David W Graham
Journal: Environ Sci Technol Date: 2003-02-01 Impact factor: 9.028

3. Bugs, stool, and the irritable bowel syndrome: too much is as bad as too little?

Authors: Nicholas J Talley; Anthony A Fodor
Journal: Gastroenterology Date: 2011-09-22 Impact factor: 22.682

4. Genotypic diversity within a natural coastal bacterioplankton population.

Authors: Janelle R Thompson; Sarah Pacocha; Chanathip Pharino; Vanja Klepac-Ceraj; Dana E Hunt; Jennifer Benoit; Ramahi Sarma-Rupavtarm; Daniel L Distel; Martin F Polz
Journal: Science Date: 2005-02-25 Impact factor: 47.728

5. Comparative metagenomics of microbial communities.

Authors: Susannah Green Tringe; Christian von Mering; Arthur Kobayashi; Asaf A Salamov; Kevin Chen; Hwai W Chang; Mircea Podar; Jay M Short; Eric J Mathur; John C Detter; Peer Bork; Philip Hugenholtz; Edward M Rubin
Journal: Science Date: 2005-04-22 Impact factor: 47.728

Review 6. Opinion: Re-evaluating prokaryotic species.

Authors: Dirk Gevers; Frederick M Cohan; Jeffrey G Lawrence; Brian G Spratt; Tom Coenye; Edward J Feil; Erko Stackebrandt; Yves Van de Peer; Peter Vandamme; Fabiano L Thompson; Jean Swings
Journal: Nat Rev Microbiol Date: 2005-09 Impact factor: 60.633

7. A general theory of the sampling process with applications to the "veil line".

Authors: A K Dewdney
Journal: Theor Popul Biol Date: 1998-12 Impact factor: 1.570

8. Barcoded primers used in multiplex amplicon pyrosequencing bias amplification.

Authors: David Berry; Karim Ben Mahfoudh; Michael Wagner; Alexander Loy
Journal: Appl Environ Microbiol Date: 2011-09-02 Impact factor: 4.792

9. Environmental genome shotgun sequencing of the Sargasso Sea.

Authors: J Craig Venter; Karin Remington; John F Heidelberg; Aaron L Halpern; Doug Rusch; Jonathan A Eisen; Dongying Wu; Ian Paulsen; Karen E Nelson; William Nelson; Derrick E Fouts; Samuel Levy; Anthony H Knap; Michael W Lomas; Ken Nealson; Owen White; Jeremy Peterson; Jeff Hoffman; Rachel Parsons; Holly Baden-Tillson; Cynthia Pfannkoch; Yu-Hui Rogers; Hamilton O Smith
Journal: Science Date: 2004-03-04 Impact factor: 47.728

Review 10. Prokaryotes: the unseen majority.

Authors: W B Whitman; D C Coleman; W J Wiebe
Journal: Proc Natl Acad Sci U S A Date: 1998-06-09 Impact factor: 11.205

28 in total

1. A general framework for quantitatively assessing ecological stochasticity.

Authors: Daliang Ning; Ye Deng; James M Tiedje; Jizhong Zhou
Journal: Proc Natl Acad Sci U S A Date: 2019-08-07 Impact factor: 11.205

2. Isolation Chip Increases Culturable Bacterial Diversity and Reduces Cultivation Bias.

Authors: Xinwu Liu; Miaoxiao Wang; Yong Nie; Xiao-Lei Wu
Journal: Curr Microbiol Date: 2021-04-05 Impact factor: 2.188

3. Detecting turnover among complex communities using null models: a case study with sky-island haemosporidian parasites.

Authors: Lisa N Barrow; Selina M Bauernfeind; Paxton A Cruz; Jessie L Williamson; Daniele L Wiley; John E Ford; Matthew J Baumann; Serina S Brady; Andrea N Chavez; Chauncey R Gadek; Spencer C Galen; Andrew B Johnson; Xena M Mapel; Rosario A Marroquin-Flores; Taylor E Martinez; Jenna M McCullough; Jade E McLaughlin; Christopher C Witt
Journal: Oecologia Date: 2021-01-23 Impact factor: 3.225

4. Systematic improvement of amplicon marker gene methods for increased accuracy in microbiome studies.

Authors: Daryl M Gohl; Pajau Vangay; John Garbe; Allison MacLean; Adam Hauge; Aaron Becker; Trevor J Gould; Jonathan B Clayton; Timothy J Johnson; Ryan Hunter; Dan Knights; Kenneth B Beckman
Journal: Nat Biotechnol Date: 2016-07-25 Impact factor: 54.908

Review 5. Stochastic Community Assembly: Does It Matter in Microbial Ecology?

Authors: Jizhong Zhou; Daliang Ning
Journal: Microbiol Mol Biol Rev Date: 2017-10-11 Impact factor: 11.056

6. Accurate read-based metagenome characterization using a hierarchical suite of unique signatures.

Authors: Tracey Allen K Freitas; Po-E Li; Matthew B Scholz; Patrick S G Chain
Journal: Nucleic Acids Res Date: 2015-03-12 Impact factor: 16.971

7. Global diversity and biogeography of bacterial communities in wastewater treatment plants.

Authors: Linwei Wu; Daliang Ning; Bing Zhang; Yong Li; Ping Zhang; Xiaoyu Shan; Qiuting Zhang; Mathew Robert Brown; Zhenxin Li; Joy D Van Nostrand; Fangqiong Ling; Naijia Xiao; Ya Zhang; Julia Vierheilig; George F Wells; Yunfeng Yang; Ye Deng; Qichao Tu; Aijie Wang; Tong Zhang; Zhili He; Jurg Keller; Per H Nielsen; Pedro J J Alvarez; Craig S Criddle; Michael Wagner; James M Tiedje; Qiang He; Thomas P Curtis; David A Stahl; Lisa Alvarez-Cohen; Bruce E Rittmann; Xianghua Wen; Jizhong Zhou
Journal: Nat Microbiol Date: 2019-05-13 Impact factor: 17.745

8. Continental scale structuring of forest and soil diversity via functional traits.

Authors: Vanessa Buzzard; Sean T Michaletz; Ye Deng; Zhili He; Daliang Ning; Lina Shen; Qichao Tu; Joy D Van Nostrand; James W Voordeckers; Jianjun Wang; Michael D Weiser; Michael Kaspari; Robert B Waide; Jizhong Zhou; Brian J Enquist
Journal: Nat Ecol Evol Date: 2019-08-19 Impact factor: 19.100

9. Temporal Changes of Virus-Like Particle Abundance and Metagenomic Comparison of Viral Communities in Cropland and Prairie Soils.

Authors: Carolyn R Cornell; Ya Zhang; Joy D Van Nostrand; Pradeep Wagle; Xiangming Xiao; Jizhong Zhou
Journal: mSphere Date: 2021-06-02 Impact factor: 4.389

10. Denitrifying and diazotrophic community responses to artificial warming in permafrost and tallgrass prairie soils.

Authors: Christopher R Penton; Derek St Louis; Amanda Pham; James R Cole; Liyou Wu; Yiqi Luo; E A G Schuur; Jizhong Zhou; James M Tiedje
Journal: Front Microbiol Date: 2015-07-21 Impact factor: 5.640