| Literature DB >> 29187837 |
Gregory B Gloor1, Jean M Macklaim1, Vera Pawlowsky-Glahn2, Juan J Egozcue3.
Abstract
Datasets collected by high-throughput sequencing (HTS) of 16S rRNA gene amplimers, metagenomes or metatranscriptomes are commonplace and being used to study human disease states, ecological differences between sites, and the built environment. There is increasing awareness that microbiome datasets generated by HTS are compositional because they have an arbitrary total imposed by the instrument. However, many investigators are either unaware of this or assume specific properties of the compositional data. The purpose of this review is to alert investigators to the dangers inherent in ignoring the compositional nature of the data, and point out that HTS datasets derived from microbiome studies can and should be treated as compositions at all stages of analysis. We briefly introduce compositional data, illustrate the pathologies that occur when compositional data are analyzed inappropriately, and finally give guidance and point to resources and examples for the analysis of microbiome datasets using compositional data analysis.Entities:
Keywords: Bayesian estimation; compositional data; correlation; count normalization; high-throughput sequencing; microbiota; relative abundance
Year: 2017 PMID: 29187837 PMCID: PMC5695134 DOI: 10.3389/fmicb.2017.02224
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
Figure 1High-throughput sequencing data are compositional. (A) illustrates that the data observed after sequencing a set of nucleic acids from a bacterial population cannot inform on the absolute abundance of molecules. The number of counts in a high throughput sequencing (HTS) dataset reflect the proportion of counts per feature (OTU, gene, etc.) per sample, multiplied by the sequencing depth. Therefore, only the relative abundances are available. The bar plots in (B) show the difference between the count of molecules and the proportion of molecules for two features, A (red) and B (gray) in three samples. The top bar graphs show the total counts for three samples, and the height of the color illustrates the total count of the feature. When the three samples are sequenced we lose the absolute count information and only have relative abundances, proportions, or “normalized counts” as shown in the bottom bar graph. Note that features A and B in samples 2 and 3 appear with the same relative abundances, even though the counts in the environment are different. The table below in (C) shows real and perceived changes for each sample if we transition from one sample to another.
Figure 2The standard microbiome analysis tool kit and the compositional replacements. A simplified standard microbiome computational workflow is illustrated. The initial normalization steps are not formally equivalent since compositional data are inherently “normalized”, and read count normalization is unnecessary. The other steps are functionally equivalent and substitute a compositionally appropriate approach for one that is not.