| Literature DB >> 35369727 |
Qiyun Zhu1,2,3, Shi Huang3,4,5, Antonio Gonzalez3, Imran McGrath4,6, Daniel McDonald3, Niina Haiminen7, George Armstrong3,4,8, Yoshiki Vázquez-Baeza4, Julian Yu1,2, Justin Kuczynski9, Gregory D Sepich-Poore10, Austin D Swafford4, Promi Das3,11, Justin P Shaffer3, Franck Lejzerowicz3,4, Pedro Belda-Ferre3,4, Aki S Havulinna12,13, Guillaume Méric14,15, Teemu Niiranen12,16,17, Leo Lahti18, Veikko Salomaa12, Ho-Cheol Kim19, Mohit Jain20,21, Michael Inouye14,22, Jack A Gilbert3,4,11, Rob Knight3,10,23.
Abstract
We introduce the operational genomic unit (OGU) method, a metagenome analysis strategy that directly exploits sequence alignment hits to individual reference genomes as the minimum unit for assessing the diversity of microbial communities and their relevance to environmental factors. This approach is independent of taxonomic classification, granting the possibility of maximal resolution of community composition, and organizes features into an accurate hierarchy using a phylogenomic tree. The outputs are suitable for contemporary analytical protocols for community ecology, differential abundance, and supervised learning while supporting phylogenetic methods, such as UniFrac and phylofactorization, that are seldom applied to shotgun metagenomics despite being prevalent in 16S rRNA gene amplicon studies. As demonstrated in two real-world case studies, the OGU method produces biologically meaningful patterns from microbiome data sets. Such patterns further remain detectable at very low metagenomic sequencing depths. Compared with taxonomic unit-based analyses implemented in currently adopted metagenomics tools, and the analysis of 16S rRNA gene amplicon sequence variants, this method shows superiority in informing biologically relevant insights, including stronger correlation with body environment and host sex on the Human Microbiome Project data set and more accurate prediction of human age by the gut microbiomes of Finnish individuals included in the FINRISK 2002 cohort. We provide Woltka, a bioinformatics tool to implement this method, with full integration with the QIIME 2 package and the Qiita web platform, to facilitate adoption of the OGU method in future metagenomics studies. IMPORTANCE Shotgun metagenomics is a powerful, yet computationally challenging, technique compared to 16S rRNA gene amplicon sequencing for decoding the composition and structure of microbial communities. Current analyses of metagenomic data are primarily based on taxonomic classification, which is limited in feature resolution. To solve these challenges, we introduce operational genomic units (OGUs), which are the individual reference genomes derived from sequence alignment results, without further assigning them taxonomy. The OGU method advances current read-based metagenomics in two dimensions: (i) providing maximal resolution of community composition and (ii) permitting use of phylogeny-aware tools. Our analysis of real-world data sets shows that it is advantageous over currently adopted metagenomic analysis methods and the finest-grained 16S rRNA analysis methods in predicting biological traits. We thus propose the adoption of OGUs as an effective practice in metagenomic studies.Entities:
Keywords: UniFrac; metagenomics; operational genomic unit; reference phylogeny; supervised learning; taxonomy independent
Year: 2022 PMID: 35369727 PMCID: PMC9040630 DOI: 10.1128/msystems.00167-22
Source DB: PubMed Journal: mSystems ISSN: 2379-5077 Impact factor: 7.324
FIG 1Feature resolution impacts community ecology analysis in small conceptual examples. (A) Synthetic data set involving three microbial communities, each having 12 unique read hits, as represented by black circles in the frequency table, to a total of 10 reference genomes (OGUs), classified under five species, three genera, and one family, as noted on the left. A phylogenetic tree of the 10 genomes is shown on the right. In this simplified case, the phylogeny is not much more complex than the taxonomy (with three more edges); however, the taxonomic assignment and the phylogenetic placement of genome O5 are not consistent. (B) Beta diversity of the data set. The three samples (circles) are connected by edges representing the pairwise distances calculated by Bray-Curtis (BC) or weighted UniFrac (WU) on the frequency table. For the latter measure, either the taxonomy or the phylogeny was used to quantify the hierarchical relationships among OGUs, as noted in the parentheses. The edge lengths were normalized so that their sum is equal in each graph. This synthetic case study demonstrates that different resolutions of features and feature structures can lead to very different conclusions regarding sample relationships.
FIG 2Analysis of the HMP metagenomes reveals clustering by body environment and differentiation by host sex. The analysis was performed on 210 samples subsampled to one million paired-end shotgun reads each. (A) Heat map of the OGU table randomly subsampled to 10 samples and 100 OGUs. (B) Heat map of a species-level taxonomic profile inferred by Bracken, subsampled to the same 10 samples and 100 randomly selected species. Empty cells represent zero values. The feature count and table density indicated in the y axis describe the entire (not subsampled) table. (C) PCoA based on OGUs using weighted UniFrac calculated with the WoL reference phylogeny based on the OGU table. Samples (dots) are colored by body site and shaped by host sex. (D) PCoA using the currently adopted method (CAM): Bray-Curtis calculated on species-level taxonomic units identified by Bracken, which shows a diagonal pattern that aligns all samples of the four nonoral body sites in one plane. (E) Proportions of community structure variance explained by the first three axes of PCoA. (F) Mean ratio of the beta diversity distances from any oral sample to a sample of the two other oral sites versus to that of nonoral body sites. The lower the mean ratio is, the more similar communities of the three oral sites are to each other in the background of multiple body environments. The bold line in each box represents the median. The whiskers represent 1.5 times the interquartile range (IQR). (G and H) PERMANOVA pseudo-F statistics indicating the differentiation of community structures by body site (G) and by host sex (H). The larger F is, the more distinct the community structures are between groups versus within groups. The y axis is aligned to an F value of 1.0 which indicates no difference. For panel G, all statistics have a P value of 0.001. For H, an asterisk indicates a P value of ≤0.05. (I) Differentially abundant phylogenetic clades by host sex inferred using PhyloFactor and visualized using EMPress on the WoL reference phylogeny. The tree was subsetted to include only OGUs detected in the data set. The top 20 clades by effect size are colored (full details provided in Fig. S1 and S2). The top five clades are numbered 1 through 5 by decreasing effect size, circled, and labeled with corresponding taxonomic annotations. The small color ring represents phylum-level annotations. The inner and outer bar plot rings indicate the OGU counts split by body site (using the same color scheme as in panels A and B) and by host sex, respectively.
FIG 3Analysis of the FINRISK metagenomes showing superior prediction accuracy over taxonomic units and 16S rRNA data. (A) The empirical error (MAE; a lower error indicates better performance) in predicting host chronological age using microbiome features at distinct taxonomic ranks in paired 16S rRNA amplicon and shallow shotgun metagenomics data with a random forest regressor. “None” represents the taxonomy-free, finest-possible level (ASV for 16S, OGU for shotgun). Small circles indicate MAEs in all iterations of 5-fold cross validation. Large circles and error bars indicate means and standard deviations of the five MAEs. (B) Scatterplot of the actual age versus the predicted age by the best-performing model with OGU features in the 5-fold cross-validation. The black line was generated using ggplot2’s local polynomial regression fitting. (C) Phylogenomic tree of 169 OGUs with importance scores of ≥0.1 in the prediction model. The tree was subsampled based on the WoL reference phylogeny and drawn to scale (branch lengths represent mutations per site). Branch colors indicate the mean importance score of all descendants of the clade. Taxonomic labels are displayed where needed. Circles and lines with stops are displayed where needed to assist location of taxonomic labels to target branches or clades.
FIG 4Impact of alignment artifacts on the result of the OGU analysis. (A to D) Robustness of analysis results on the HMP data set. Four commonly used beta diversity measures were tested: Jaccard (unweighted, tree free), Bray-Curtis (weighted, tree free), unweighted UniFrac (tree aware) and weighted UniFrac (tree aware). (A and B) Impact of the maximum number (k) of alignments to consider for each query sequence. This k value corresponds to Bowtie2’s “-k” parameter. “All” instructs Bowtie2 to return all possible alignments. (C and D) Impact of the feature filtering threshold. The original OGU table was filtered to remove OGUs that were below the per-sample relative abundance threshold indicated at the x axis. “None” indicates no filtering (original table). (A and C) PERMANOVA pseudo-F statistics for the separation between samples from male and female hosts. (B and D) Mean ratios of the distances from any oral sample to samples of the two other oral sites versus to that of nonoral body sites. The error bars indicate the means and standard deviations of the ratios.
FIG 5Relationship between OGU taxonomy and community composition. (A to C) Presence of OGU-informed taxonomic units in Bracken-inferred taxonomic profiles using the HMP data set (n = 210). (A) Per-sample fraction of OGUs whose corresponding taxonomic units at different ranks were found by Bracken. (B) Presence/absence of corresponding species of all OGUs in all samples (n = 307,060) in the Bracken result. The y axis indicates the frequency of individual OGUs per sample (out of 1 million paired-end reads). Outliers are not displayed due to the sample size. The bold line is a logistic regression curve, with coefficient and intercept annotated. (C) Per-sample fraction of OGUs filtered by different relative abundance thresholds whose corresponding species were found by Bracken. (D to F) Evaluation of the accuracy of OGU or species assignments and abundance estimation using a simulated metagenomic data set (n = 25) by the area under the precision/recall curve (AUPR) (D), Bray-Curtis (E), and weighted UniFrac (F) against the ground truth. Three tools—SHOGUN, Bracken, and MIDAS—were evaluated. The bold line and the whiskers represent median and 1.5 times the IQR, respectively. Each dot represents a simulated metagenomic sample.
FIG 6Comparison of beta diversity analysis results on the HMP data set using alternative protocols. (A to D) Comparison of the OGU method with five taxonomic profilers. The OGU table was extracted from SHOGUN’s alignment file using Woltka and analyzed using weighted UniFrac with the WoL reference phylogeny. It was compared with species-level profiles analyzed using Bray-Curtis. (E to H) Comparison of six reference genome databases used for generating the OGU table. The six databases are ordered by their volume from small (left) to large (right). Each OGU table was analyzed using either Bray-Curtis or weighted UniFrac with the original phylogenetic tree provided in the database. (I to L) Comparison of three sequence aligners used for generating the OGU table. (A, E, and I) Proportions of variance explained by the first three axes of PCoA. (B, F, and J) PERMANOVA pseudo-F statistics for the separation between samples from male and female hosts. (C, G, and K) P values of the corresponding PERMANOVA tests. (D, H, and L) Mean ratios of the distances from any oral sample to samples of the two other oral sites versus to that of nonoral body sites. The bold line and the whiskers represent median and 1.5 times the IQR, respectively.
FIG 7Impact of sequencing depth on the result of the OGU analysis. The original OGU table generated from 1 million paired-end reads of the HMP data set was randomly subsampled to each of the sampling depths indicated at the x axis, which is equivalent to the number of paired-end reads in the original sequencing data (see Materials and Methods). (A) Pearson correlation coefficient (r) between the composition (frequencies) of the original OGU table and the subsampled tables. To ensure visual resolution of the boxes, three outliers of 1,000, 200 and 100, respectively, are not visible in the range of the y axis. (B) Pearson’s r between the ratios of distances from any oral sample to sample of the two other oral sites versus to that of nonoral body sites. (C and D) PERMANOVA pseudo-F statistics on body site (C) and host sex (D). The results presented in panels B, C, and D were calculated from 10 replicates of random subsampling at each depth. The dashed lines indicate the statistic calculated on the original table. In all panels, the bold line in each box represents the median. The whiskers represent 1.5 times the IQR. Note that the ranges of the y axes differ between panels.
FIG 8Impact of classification resolution and reference genome pool on the analysis results on the HMP data set. All feature tables were generated using the SHOGUN/Woltka protocol. (A to C) Comparison of community profiles classified to OGUs or each of the seven standard taxonomic ranks. (A) Feature counts and table densities of the profiles. (D to I) Comparison of OGU-level and species-level profiles generated using a gradient of sizes of the reference genome pool sampled from the same genome collection. (E) Numbers of genomes and species in the three genome pools: “half,” “core,” and “double.” Note that “core” is the original WoL database, and it is a superset of “half” and a subset of “double.” (I) Numbers of OGUs and species classified in the profiles. (B, E, and H) PERMANOVA pseudo-F statistics for the separation between samples from male and female hosts. (C, F, and I) P values of the corresponding PERMANOVA tests.
FIG 9Procedure for generating OGU tables, as implemented in Woltka. This program serves as an interface connecting upstream sequence alignment and downstream community diversity analysis. The inputs are sequence alignments, and the outputs are feature tables (in this case, OGU tables). The pipeline contains multiple procedures, with rich format support and flexible options to address diverse user needs.