| Literature DB >> 35300475 |
Philip J Schmidt1, Ellen S Cameron2, Kirsten M Müller2, Monica B Emelko1.
Abstract
Diversity analysis of amplicon sequencing data has mainly been limited to plug-in estimates calculated using normalized data to obtain a single value of an alpha diversity metric or a single point on a beta diversity ordination plot for each sample. As recognized for count data generated using classical microbiological methods, amplicon sequence read counts obtained from a sample are random data linked to source properties (e.g., proportional composition) by a probabilistic process. Thus, diversity analysis has focused on diversity exhibited in (normalized) samples rather than probabilistic inference about source diversity. This study applies fundamentals of statistical analysis for quantitative microbiology (e.g., microscopy, plating, and most probable number methods) to sample collection and processing procedures of amplicon sequencing methods to facilitate inference reflecting the probabilistic nature of such data and evaluation of uncertainty in diversity metrics. Following description of types of random error, mechanisms such as clustering of microorganisms in the source, differential analytical recovery during sample processing, and amplification are found to invalidate a multinomial relative abundance model. The zeros often abounding in amplicon sequencing data and their implications are addressed, and Bayesian analysis is applied to estimate the source Shannon index given unnormalized data (both simulated and experimental). Inference about source diversity is found to require knowledge of the exact number of unique variants in the source, which is practically unknowable due to library size limitations and the inability to differentiate zeros corresponding to variants that are actually absent in the source from zeros corresponding to variants that were merely not detected. Given these problems with estimation of diversity in the source even when the basic multinomial model is valid, diversity analysis at the level of samples with normalized library sizes is discussed.Entities:
Keywords: Markov chain Monte Carlo; Shannon index; amplicon sequencing; normalization; rarefying
Year: 2022 PMID: 35300475 PMCID: PMC8921663 DOI: 10.3389/fmicb.2022.728146
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
Summary of random errors in amplicon sequencing and associated assumptions in the multinomial relative abundance model.
| Error source | Description of error and compatibility with multinomial model | Assumptions |
|---|---|---|
| Sample collection | The random sampling error describing variability in the number of discrete objects captured in a sample yields a Poisson distribution if microorganisms are randomly dispersed in a large source. This error is compatible with a multinomial model for proportional abundance of variants. Clustering, including multiple gene copies per organism, leads to excess variability that is incompatible with a multinomial model. |
All microorganisms are randomly dispersed (i.e., not clustered) with only one gene copy each |
| Sample handling | The number of a particular type of microorganisms may increase or decrease between sample collection and sample processing. Growth inflates the number of microorganisms at the level of diversity represented before growth occurred and is incompatible with a multinomial model. Decay is a form of random analytical error that is compatible with a multinomial model if it is consistent among variants. |
No growth No differential decay (analytical recovery) among variants |
| Sample processing | The number of gene sequences subjected to amplification may be lower than the number in the sample prior to processing due to losses (e.g., adherence to apparatus, not all genes extracted, sample partitioning). This is compatible with a multinomial model if analytical recovery is constant among variants. |
No differential losses (analytical recovery) among variants |
| Amplification | The number of gene sequences is purposefully increased using polymerase chain reactions, inflating the number of gene sequences at the level of diversity represented before amplification occurred, and is incompatible with a multinomial model. Copy errors are a form of loss for the original sequences that were incorrectly copied and produces erroneous sequences that may then be further amplified. Erroneous sequences are incompatible with a multinomial model unless all of them are removed from the data. |
Pre-amplification variant diversity is fully identical to source diversity and sequences are perfectly duplicated in each PCR cycle No differential amplification efficiency or potential for copy errors among variants |
| Amplicon sequencing | Only a subsample of sequences are read, and all variants must be equally likely to be read. Sequence reading errors are a form of loss for the original sequences that were incorrectly read and also produces erroneous sequence reads. Sequence reading errors are incompatible with a multinomial model unless all resultant erroneous sequences are removed from the data. |
No differential sequence reading errors among variants or differential losses Data denoising must remove all erroneous sequence reads and no legitimate reads |
Without these difficult assumptions, the multinomial model describes post-amplification variant diversity rather than source microbial diversity.
Figure 1Box and whisker plot of Markov chain Monte Carlo (MCMC) samples from posterior distributions of the Shannon index based on analysis of simulated data (Supplementary Data Sheet 2). Data with various library sizes (Supplementary Table S1) were analyzed in each of three ways: with zeros excluded (not applicable in some cases), with zeros included for non-detected variants, and with extraneous zeros corresponding to variants that do not exist in the source. The true Shannon index of the source from which the data were simulated is 3.03.
Figure 2Box and whisker plot of MCMC samples from posterior distributions of the Shannon index based on analysis of environmental amplicon sequencing data (Supplementary Data Sheet 2). Data with various library sizes between 10,000 and 30,000 were analyzed three ways: with zeros excluded, with zeros included in a 1,142-row ASV table (no zero rows), and with additional zeros from the full 3,342-row ASV table including variants with rows of zeros (detected in other samples from the same study area).
Figure 3Representation of how library size and diversity quantified therefrom relate to uncertainty in statistical inference about source diversity and variability introduced by repeatedly rarefying to the smallest obtained library size. In this case, rarefying repeatedly evaluates the extent of the diversity (after amplification) exhibited if a library size of only n = 5,000 had been obtained from each sample.
Figure 4Demonstration of normalization by rarefying repeatedly using simulated data. The box and whisker plot for the library size of 25 (*) illustrates how the Shannon index varies among simulated samples and is consistently below the actual Shannon index of 3.03 (red line). The Shannon index calculated from the samples with larger library sizes (red dots) deteriorates at small library sizes. The box and whisker plots for these library sizes illustrate what Shannon index might have been calculated if only a library size of 25 had been obtained (rarefying 1,000 times to this level). In all cases, a Shannon index of about 2.5 is expected with a library size of 25.