| Literature DB >> 34267244 |
Jacob Bien1, Xiaohan Yan2, Léo Simpson3,4, Christian L Müller5,6,7.
Abstract
Modern high-throughput sequencing technologies provide low-cost microbiome survey data across all habitats of life at unprecedented scale. At the most granular level, the primary data consist of sparse counts of amplicon sequence variants or operational taxonomic units that are associated with taxonomic and phylogenetic group information. In this contribution, we leverage the hierarchical structure of amplicon data and propose a data-driven and scalable tree-guided aggregation framework to associate microbial subcompositions with response variables of interest. The excess number of zero or low count measurements at the read level forces traditional microbiome data analysis workflows to remove rare sequencing variants or group them by a fixed taxonomic rank, such as genus or phylum, or by phylogenetic similarity. By contrast, our framework, which we call trac (tree-aggregation of compositional data), learns data-adaptive taxon aggregation levels for predictive modeling, greatly reducing the need for user-defined aggregation in preprocessing while simultaneously integrating seamlessly into the compositional data analysis framework. We illustrate the versatility of our framework in the context of large-scale regression problems in human gut, soil, and marine microbial ecosystems. We posit that the inferred aggregation levels provide highly interpretable taxon groupings that can help microbiome researchers gain insights into the structure and functioning of the underlying ecosystem of interest.Entities:
Year: 2021 PMID: 34267244 PMCID: PMC8282688 DOI: 10.1038/s41598-021-93645-3
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Illustration of fixed level and trac-based taxon aggregation. The trees represent the available taxonomic grouping of 16 base level taxa at the leaves (here OTU or ASV). (A) Arithmetic aggregation of OTUs/ASVs to a fixed level (genus rank). All taxon base level counts are summed up to the respective parent genus. (B) trac’s flexible tree-based aggregation in which the choice of what level to aggregate to can vary across the tree (e.g., two OTUs/ASVs, two species, one genus, and one family). The aggregation is based on the geometric mean of OTU/ASV counts and determined in a data-adaptive fashion with the goal of optimizing to the particular prediction task. (C) Summary statistics of standard trac-inferred aggregation levels on all seven regression tasks. The Data column denotes the respective regression scenario (study name and outcome of interest), n the number of samples, and p the number of base level taxa (OTUs) in the data. The values in the taxonomic rank columns (Kingdom, Phylum, etc.) indicate the average number of taxa selected on that level by trac in the respective regression task. Averages are taken over ten random training/out-of-sample test data splits.
Figure 2Overview of trac aggregation and model selection with standard weighting on the sCD14 data. (A) Varying the trac regularization parameter produces a solution (aggregation) path. Each colored line corresponds to a distinct taxon, showing its coefficient value as the tuning parameter increases. The larger is, the more coefficients are set to 0, leading to a more parsimonious model. The dotted and dashed vertical lines mark the -values selected by the CV best and 1SE rule, respectively. (B) Illustration of the cross-validation (CV) procedure. Mean (and standard error) CV error vs. path with selected values at best CV error (dotted vertical line) or with the 1SE rule (dashed vertical line). (C) The actual vs. predicted values of sCD14 on the test set (1SE rule in red, CV best in blue). The Pearson correlation of trac predictions on the test set is 0.37 with the CV best solution and 0.23 with the CV 1SE rule, respectively. (D) Error on the test set vs. number of selected aggregations. (E) The trac model selected with the 1SE rule comprises five taxa across four levels, listed in the bottom table (see Fig. 3A for tree visualization of the aggregations). The column labeled gives the nonzero coefficient values, which are in the same units as the sCD14 response variable.
Figure 3Taxonomic tree visualization of trac aggregations in four selected scenarios using sCD14 data (training/test split 1). Each tree represents the taxonomy of the OTUs. Colored branches highlight the estimated trac taxon aggregations. The black dots mark the selected taxa of the respective sparse log-contrast model. The outer rim represents the value of coefficients in the trac model from Eq. (1). (A) Standard trac () with OTUs as taxon base level selects five aggregations. (B) Weighted trac () with OTU base level selects eleven aggregations, including six on the OTU level. Four of these OTUs were also selected by the sparse log-contrast model which comprises nine OTUs in total (black dots) (see Suppl. Tables 6 and 7 for the selected coefficients). (C) Standard trac () with family base level selects three aggregations. (D) Weighted trac () with family as taxon base level selects five aggregations, including one family (Enterobacteriacaeae) shared with the sparse log-contrast model when also applied at the family base level (see Suppl. Tables 10 for the six selected families).
Average out-of-sample test errors (rounded average model sparsity in parenthesis) for trac () and sparse log-contrast models, respectively. Each row considers a different base level (OTU, genus, and family). Each number is averaged over ten different training/test splits of the sCD14 data.
| Base level | trac (a = 1) | trac (a = 1/2) | Sparse log-contrast | |
|---|---|---|---|---|
| OTU | 539 | 6.3e | 6.7e | 6.8e |
| Genus | 282 | 6.8e | 7.1e | 7.1e |
| Family | 112 | 6.5e | 6.5e | 6.6e |
Figure 4Taxonomic tree visualization of trac aggregations ( using the Central Park soil data (training/test split 1). Each tree represents the taxonomy of the OTUs. Colored branches highlight the estimated trac taxon aggregations. The black dots mark the selected taxa of the sparse log-contrast model. The outer rim represents the value of coefficients in the trac model from Eq. (1). (A) Standard trac () with OTUs as taxon base level selects six aggregations. (B) Weighted trac () with OTU base level selects 28 aggregations, including 13 on the OTU level. Four of these OTUs are also selected by the sparse log-contrast model which comprises 21 OTUs in total (black dots) (see Suppl. Tables 15 and 16 for the selected coefficients). (C) The table lists the coefficients associated with Eq. (2) for the trac () model corresponding to the tree shown in (A). These values are in the same units as the pH response variable.
Average out-of-sample test errors (rounded average model sparsity in parenthesis) for trac () and sparse log-contrast models, respectively. Each row represents the results for base level OTU, genus, and family. Each value is averaged over ten different training/test splits of the Central Park soil data.
| Base level | trac (a | trac (a | Sparse log-contrast | |
|---|---|---|---|---|
| OTU | 3379 | 0.40 (10) | 0.39 (18) | 0.39 (33) |
| Genus | 2779 | 0.40 (13) | 0.38 (22) | 0.39 (26) |
| Family | 1492 | 0.39 (10) | 0.39 (15) | 0.40 (29) |
Figure 5Taxonomic tree visualization of trac aggregations (OTUs as taxon base level, for salinity prediction using Tara data (training/test split 1). Each tree represents the taxonomy of the miTAG OTUs. Colored branches highlight the estimated trac taxon aggregations. The black dots mark the selected taxa of the sparse log-contrast model. The outer rim represents the value of coefficients in the trac model from Eq. (1). (A) Standard trac () selects four aggregations on the kingdom, phylum, and class level. (B) Weighted trac () selects ten aggregations across all taxonomic ranks, including a single OTU (OTU520). This OTU is also selected by the sparse log-contrast model which comprises nine OTUs in total (black dots) (see Suppl. Table 18 for the selected coefficients). Both trac models select the phylum Bacteroidetes and the Alphaproteobacteria class. (C) The table lists the coefficients associated with Eq. (2) for the trac () model corresponding to the tree shown in ( A). These values are in the same units as the salinity response variable.
Average out-of-sample test errors (rounded average model sparsity in parenthesis) for trac () and sparse log-contrast models, respectively. Each row represents the results for base level OTU, genus, and family and the corresponding dimensionality of the base level. Each value is averaged over ten different training/test splits of the Tara data.
| Base level | trac (a | trac (a | Sparse log-contrast | |
|---|---|---|---|---|
| OTU | 8916 | 2.1 (7) | 1.8 (14) | 1.3 (24) |
| Genus | 4220 | 2.0 (7) | 1.5 (14) | 1.4 (34) |
| Family | 1869 | 2.1 (6) | 1.7 (10) | 1.6 (13) |