| Literature DB >> 30801021 |
Cameron Martino1,2, James T Morton1,3, Clarisse A Marotz1, Luke R Thompson4,5, Anupriya Tripathi1, Rob Knight1,3,6, Karsten Zengler1,6,7.
Abstract
The central aims of many host or environmental microbiome studies are to elucidate factors associated with microbial community compositions and to relate microbial features to outcomes. However, these aims are often complicated by difficulties stemming from high-dimensionality, non-normality, sparsity, and the compositional nature of microbiome data sets. A key tool in microbiome analysis is beta diversity, defined by the distances between microbial samples. Many different distance metrics have been proposed, all with varying discriminatory power on data with differing characteristics. Here, we propose a compositional beta diversity metric rooted in a centered log-ratio transformation and matrix completion called robust Aitchison PCA. We demonstrate the benefits of compositional transformations upstream of beta diversity calculations through simulations. Additionally, we demonstrate improved effect size, classification accuracy, and robustness to sequencing depth over the current methods on several decreased sample subsets of real microbiome data sets. Finally, we highlight the ability of this new beta diversity metric to retain the feature loadings linked to sample ordinations revealing salient intercommunity niche feature importance. IMPORTANCE By accounting for the sparse compositional nature of microbiome data sets, robust Aitchison PCA can yield high discriminatory power and salient feature ranking between microbial niches. The software to perform this analysis is available under an open-source license and can be obtained at https://github.com/biocore/DEICODE; additionally, a QIIME 2 plugin is provided to perform this analysis at https://library.qiime2.org/plugins/deicode/.Entities:
Keywords: compositional; computational biology; matrix completion; metagenomics; microbiome
Year: 2019 PMID: 30801021 PMCID: PMC6372836 DOI: 10.1128/mSystems.00016-19
Source DB: PubMed Journal: mSystems ISSN: 2379-5077 Impact factor: 6.496
FIG 1Benchmarking the rclr preprocessing step. Toy example with simple 3-taxon community sampled over time (A). Distance calculated between the t = 1 community and subsequent communities demonstrates the robustness of Aitchison distance compared to Euclidean distance (B).
FIG 2A general overview of the workflow. (A) A sparse, raw sequencing count table with samples on the y axis and features (i.e., OTUs, genes) on the x axis. (B) The data are preprocessed by a robust centered log ratio transform (rclr) on only the known (nonzero) values. (C) Matrix completion with a robust principal-component analysis (RPCA) that operates on only the observed values in the table resolves a loading by samples and by features. These loadings can be directly used for ordination (D), biclustering (E), and the identification of important taxa driving clustering in both the previous plots (F).
FIG 3(A) Comparison of KL-divergence (y axis) between simulated base truth data between RPCA output from raw count data and rclr-preprocessed data. (B and C) Comparison between RPCA ordination by PERMANOVA F-statistic (B) and KNN classifier accuracy (C). All are at various sequencing depths from 1,000 to 10,000 reads per sample. (D and E) Comparison of positive- (D) and negative-control (E) simulation by biclustering (top) and RPCA ordination (bottom).
FIG 4A case study of RPCA on real data sets; sponge (left; A, B, and E) and sleep apnea (right; C, D, and F). PERMANOVA F test statistic (y axis) (A and C) or KNN classifier accuracy (B and D) by subsamples of the data sets. Ordination plots between 70 samples total (left) and maximum number of samples (right) compared between RPCA (top), generalized weighted UniFrac (alpha = 1) (middle), and Bray-Curtis (bottom) (E and F). Sponge data set plotted between healthy (blue) and stressed (red) (E) along with sleep apnea data set plotted between air (blue) and IHH (red) (F).
FIG 5A case study of RPCA feature loadings on real data sets; sponge (left; A and C) and sleep apnea (right; B and D). Heat maps of clr-transformed sOTU tables with samples sorted by metadata and features sorted by RPCA feature loadings (A and B). Absolute highest (middle) and lowest (bottom) feature loading sOTUs (top) plotted as log ratios (x axis) by sample loading PC1 (y axis) (C and D).