| Literature DB >> 23760464 |
Jizhong Zhou1, Yi-Huei Jiang, Ye Deng, Zhou Shi, Benjamin Yamin Zhou, Kai Xue, Liyou Wu, Zhili He, Yunfeng Yang.
Abstract
The site-to-site variability in species composition, known as β-diversity, is crucial to understanding spatiotemporal patterns of species diversity and the mechanisms controlling community composition and structure. However, quantifying β-diversity in microbial ecology using sequencing-based technologies is a great challenge because of a high number of sequencing errors, bias, and poor reproducibility and quantification. Herein, based on general sampling theory, a mathematical framework is first developed for simulating the effects of random sampling processes on quantifying β-diversity when the community size is known or unknown. Also, using an analogous ball example under Poisson sampling with limited sampling efforts, the developed mathematical framework can exactly predict the low reproducibility among technically replicate samples from the same community of a certain species abundance distribution, which provides explicit evidences of random sampling processes as the main factor causing high percentages of technical variations. In addition, the predicted values under Poisson random sampling were highly consistent with the observed low percentages of operational taxonomic unit (OTU) overlap (<30% and <20% for two and three tags, respectively, based on both Jaccard and Bray-Curtis dissimilarity indexes), further supporting the hypothesis that the poor reproducibility among technical replicates is due to the artifacts associated with random sampling processes. Finally, a mathematical framework was developed for predicting sampling efforts to achieve a desired overlap among replicate samples. Our modeling simulations predict that several orders of magnitude more sequencing efforts are needed to achieve desired high technical reproducibility. These results suggest that great caution needs to be taken in quantifying and interpreting β-diversity for microbial community analysis using next-generation sequencing technologies. IMPORTANCE Due to the vast diversity and uncultivated status of the majority of microorganisms, microbial detection, characterization, and quantitation are of great challenge. Although large-scale metagenome sequencing technology such as PCR-based amplicon sequencing has revolutionized the studies of microbial communities, it suffers from several inherent drawbacks, such as a high number of sequencing errors, biases, poor quantitation, and very high percentages of technical variations, which could greatly overestimate microbial biodiversity. Based on general sampling theory, this study provided the first explicit evidence to demonstrate the importance of random sampling processes in estimating microbial β-diversity, which has not been adequately recognized and addressed in microbial ecology. Since most ecological studies are involved in random sampling, the conclusions learned from this study should also be applicable to other ecological studies in general. In summary, the results presented in this study should have important implications for examining microbial biodiversity to address both basic theoretical and applied management questions.Entities:
Mesh:
Year: 2013 PMID: 23760464 PMCID: PMC3684833 DOI: 10.1128/mBio.00324-13
Source DB: PubMed Journal: MBio Impact factor: 7.867
FIG 1 An analogous example to simulate random sampling processes. Three identical jars contain the same number and types of balls, with identical ball abundance distribution.
Chi-square-based goodness-of-fit test of the observed and predicted percentages of overlap for the analogous example[]
| OTU abundance | Two samples | Three samples | ||||||
|---|---|---|---|---|---|---|---|---|
| Known | Unknown | Known | Unknown | |||||
| χ2 | χ2 | χ2 | χ2 | |||||
| Exponential | 6.6 × 10−5 | 0.999 | 6.6 × 10−5 | 0.999 | 2.7 × 10−4 | 0.999 | 2.7 × 10−4 | 0.999 |
| Gamma | 1.9 × 10−6 | 0.999 | 5.1 × 10−4 | 0.999 | 9.4 × 10−6 | 0.999 | 6.5 × 10−3 | 0.999 |
| Lognormal | 4.2 × 10−4 | 0.999 | 2.1 × 10−4 | 0.999 | 1.1 × 10−3 | 0.999 | 1.1 × 10−3 | 0.999 |
| Inverse gamma | 4.0 × 10−6 | 0.999 | 2.1 × 10−4 | 0.999 | 6.5 × 10−6 | 0.999 | 2.2 × 10−3 | 0.999 |
| Inverse Gaussian | 3.3 × 10−5 | 0.999 | 9.1 × 10−4 | 0.999 | 7.5 × 10−5 | 0.999 | 8.5 × 10−4 | 0.999 |
Detailed information is presented in Fig. S2 and S3 in the supplemental material.
FIG 2 The relationships between the expected Jaccard overlaps of ball colors and sampling efforts under the exponential abundance distribution, assuming the community has 106 individual balls and 104 types of balls, with different colors. Distribution parameter is set to λ = 1 × 10−2. In each case, we calculated the theoretically predicted overlap (blue line) by equation 8 when N is known, the predicted overlap (red line) by equation 10 when N is unknown, and the average observed overlap (point) through simulations of 100 repeated samplings. (A) Two samples. The sample ratio is a1 = a2. (B) Three samples. The sample ratio is a1 = a2 = a3.
Chi-square-based goodness-of-fit test of the observed and predicted percentages of overlap for the experimental data[]
| Communities | Similarity | Forward primer | Reverse primer | Combined | |||
|---|---|---|---|---|---|---|---|
| χ2 | χ2 | χ2 | |||||
| Two tags | Jaccard | 0.040 | 0.999 | 0.060 | 0.999 | 0.038 | 0.999 |
| Bray-Curtis | 0.106 | 0.999 | 0.122 | 0.999 | 0.075 | 0.999 | |
| Three tags | Jaccard | 0.006 | 0.999 | 0.010 | 0.999 | 0.005 | 0.999 |
| Bray-Curtis | 0.008 | 0.999 | 0.026 | 0.999 | 0.016 | 0.999 | |
Detailed data are listed in Table S4A (two tags, Jaccard), S4B (two tags, Bray-Curtis), S5A (three tags, Jaccard), and S5B (three tags, Bray-Curtis) in the supplemental material.
FIG 3 Prediction of sampling efforts for desired OTU overlap. (A) Desired overlap between two tags based on the combined sequences from sample 2UC. The sampling efforts were calculated based on equation 15. The parameters for species abundance distribution were from Table S3A in the supplemental material. (B) Desired overlap among three tags based on the combined sequences from sample 1UC. The sampling efforts were calculated based on equation 16. The parameters for species abundance distribution were from Table S3B.