| Literature DB >> 35477286 |
Cameron Martino1,2,3, Daniel McDonald1, Kalen Cantrell3,4, Amanda Hazel Dilmore1,5, Yoshiki Vázquez-Baeza3,4, Liat Shenhav6, Justin P Shaffer1, Gibraan Rahman1,2, George Armstrong1,2,3, Celeste Allaband1,5, Se Jin Song3,4, Rob Knight1,3,7,8.
Abstract
Microbiome data have several specific characteristics (sparsity and compositionality) that introduce challenges in data analysis. The integration of prior information regarding the data structure, such as phylogenetic structure and repeated-measure study designs, into analysis, is an effective approach for revealing robust patterns in microbiome data. Past methods have addressed some but not all of these challenges and features: for example, robust principal-component analysis (RPCA) addresses sparsity and compositionality; compositional tensor factorization (CTF) addresses sparsity, compositionality, and repeated measure study designs; and UniFrac incorporates phylogenetic information. Here we introduce a strategy of incorporating phylogenetic information into RPCA and CTF. The resulting methods, phylo-RPCA, and phylo-CTF, provide substantial improvements over state-of-the-art methods in terms of discriminatory power of underlying clustering ranging from the mode of delivery to adult human lifestyle. We demonstrate quantitatively that the addition of phylogenetic information improves effect size and classification accuracy in both data-driven simulated data and real microbiome data. IMPORTANCE Microbiome data analysis can be difficult because of particular data features, some unavoidable and some due to technical limitations of DNA sequencing instruments. The first step in many analyses that ultimately reveals patterns of similarities and differences among sets of samples (e.g., separating samples from sick and healthy people or samples from seawater versus soil) is calculating the difference between each pair of samples. We introduce two new methods to calculate these differences that combine features of past methods, specifically being able to take into account the principles that most types of microbes are not in most samples (sparsity), that abundances are relative rather than absolute (compositionality), and that all microbes have a shared evolutionary history (phylogeny). We show using simulated and real data that our new methods provide improved classification accuracy of ordinal sample clusters and increased effect size between sample groups on beta-diversity distances.Entities:
Keywords: beta-diversity; compositional data analysis; phylogenetics
Year: 2022 PMID: 35477286 PMCID: PMC9238373 DOI: 10.1128/msystems.00050-22
Source DB: PubMed Journal: mSystems ISSN: 2379-5077 Impact factor: 7.324
FIG 1Overview of the algorithm underlying phylo-RPCA and phylo-CTF. The input of a table of count data and a phylogeny representing the features of the table (A). First, the table is expanded to represent all nodes up to the root of the phylogeny through summing up each node (B), second, the closure of the expanded table is multiplied by the branch lengths following Hamady 2010 (13), and the data is then transformed with the rclr (C) and then RPCA is performed. The output provides a phylogenetic biplot where arrows are both leaves and internal nodes of the input phylogeny (D) whose direction can inform log-ratios of aggregated leaves counts (E).
FIG 2As phylogeny becomes more synchronized with the samples’ clusters, the additional benefit of phylogenetic information in RPCA increases. A data-driven simulation of shotgun microbiome data of three sample groups, based on EMP500 data, with reduced sequencing depth across plots from 2,000,000 to 200 reads (A). Comparison of phylogenetic RPCA sample clustering with a randomly generated tree and as a percentage of the tips of the phylogenetic tree, originally perfectly representing the features clustering the samples, are randomly shuffled 10-fold (B). Comparison across simulation read depth (colors from low to high) and phylogenetic-feature-sample cluster synchrony (x axis) for PERMANOVA F-statistic (left), area under the precision-recall curve (PR-AUC, middle), and area under the receiver operator characteristic curve (right) (C).
FIG 3Phylogeny improves discriminatory power in cross-sectional data and in repeated measure data compared to existing methods. Comparison of phylogenetic RPCA/CTF (green) against nonphylogenetic version (light-green), Aitchison PCA (blue), Jaccard (orange), phylogenetically informed unweighted UniFrac, and generalized UniFrac with alpha varying level of abundance weighting (colored in reds from least to most weighted by abundance). Compared by PERMANOVA F-statistic on beta-diversity distances (left column), 10-fold KNN classification cross-validation was evaluated through the area under the precision-recall (right column). Comparison of cross-sectional data by hand skin bacterial communities from McCall et al. compared across villages representing an urbanization gradient from Peru to Brazil (A). Repeated measure comparison of fecal bacterial communities from ECAM data set compared across age and compared by birth mode (B).
FIG 4Phylogenetic-RPCA and -CTF resolve ordinal and phylogenetically aggregated log-ratios in birth-mode (top) and westernization gradients by village (bottom) respectively. Phylo-CTF ordination PC1 (y axis) colored by birth mode (A), Bacterial and Archaeal phylogeny colored by PC1 feature loadings that also separate the respective sample groups PC1 for phylo-CTF (B), and log-ratio of high (numerator, colored by a purple dot in the phylogeny) and low (denominator, colored by a green dot in the phylogeny) value loadings identified in the respective phylogenies and sample groupings for phylo-CTF (C). Phylo-RPCA PC1 (x axis) and PC2 (y axis) colored by village across urbanization gradient (D), phylogeny colored by PC1 feature loadings (E), log-ratio of high (numerator, colored by a purple dot in the phylogeny) and low (denominator, colored by a green dot in the phylogeny) value loadings identified in the respective phylogenies and sample groupings for phylo-RPCA (F).