| Literature DB >> 32093762 |
Sumaiya Nazeen1, Yun William Yu2,3, Bonnie Berger4,5.
Abstract
Microbial populations exhibit functional changes in response to different ambient environments. Although whole metagenome sequencing promises enough raw data to study those changes, existing tools are limited in their ability to directly compare microbial metabolic function across samples and studies. We introduce Carnelian, an end-to-end pipeline for metabolic functional profiling uniquely suited to finding functional trends across diverse datasets. Carnelian is able to find shared metabolic pathways, concordant functional dysbioses, and distinguish Enzyme Commission (EC) terms missed by existing methodologies. We demonstrate Carnelian's effectiveness on type 2 diabetes, Crohn's disease, Parkinson's disease, and industrialized and non-industrialized gut microbiome cohorts.Entities:
Keywords: Alignment-free binning; Comparative functional metagenomics; Compositional gapped binning; Functional profiling; Metagenomic binning
Mesh:
Year: 2020 PMID: 32093762 PMCID: PMC7038607 DOI: 10.1186/s13059-020-1933-7
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Comparative functional metagenomics with Carnelian. Preprocessing. We build a gold standard reference database by combining reviewed prokaryotic proteins with complete Enzyme Commission (EC) labels and evidence of existence from UniProtKB/Swiss-Prot with curated prokaryotic catalytic residues with complete EC labels from the Catalytic Site Atlas. Carnelian first represents gold standard proteins in a compact feature space using low-density, even-coverage locality-sensitive Opal-Gallager hashing. Then, it trains a set of one-against-all (OAA) classifiers (implemented using the Vowpal Wabbit framework) using the compact feature representation of those proteins as well as negative samples based off of random shuffled sequences generated by HMMER. Functional profiling. To functionally profile reads from a whole metagenomic sequencing (WMS) experiment, Carnelian first performs probabilistic ORF prediction using FragGeneScan. Next, the ORFs are represented in a compact feature space using the same Opal-Gallager hashing technique. The trained OAA classifier ensemble is then used to classify the ORFs into appropriate EC bins. Abundance estimates of ECs are computed from the raw ORF counts in the EC bins by normalizing against effective protein length per EC bin and a per million scaling factor. Pathway profiles (Orange) are computed by grouping the ECs into metabolic pathways and summing the abundance estimates. Comparative metagenomics. We start from pathway profiles (Orange) of different populations and conditions. (Blue) Functional relatedness of healthy microbiomes across different populations is assessed by co-abundance pathway analysis. Pathway co-abundance estimates are quantified by Kendall’s rank correlation. Co-abundance clusters are determined by Ward-Linkage hierarchical clustering, and the PERMANOVA test is used to determine if the centroids of those clusters differ between populations A and B. (Green) Functional trends analysis across different case-control cohorts of a disease is performed using differential abundance analysis by Wilcoxon rank-sum test and shared significance analysis by Fisher’s combined probability test
Shared functional dysbiosis between two type 2 diabetes (T2D) cohorts and two Crohn’s disease (CD) cohorts
| ID | Pathway | Carnelian | mi-faser | HUMAnN2 | Kraken2 | Fisher’s |
|---|---|---|---|---|---|---|
| (a) Common pathways between Chinese and European T2D cohorts | ||||||
| 00030 | Pentose phosphate pathway | SB | NB | NB | NB | 6.59E −03 |
| 00040 | Pentose and gluconerate interconversions | SB | NB | NB | NB | 9.88E −03 |
| 00051 | Fructose and mannose metabolism | SB | SE | NB | NB | 4.94E −04 |
| 00052 | Galactose metabolism | SB | NB | NB | NB | 4.71E −03 |
| 00061 | Fatty acid biosynthesis | SB | SC | NB | SC | 6.56E −03 |
| 00190 | Oxidative phosphorylation | SB | SE | SC | SE | 4.97E −04 |
| 00250 | Alanine, aspartate, and glutamate metabolism | SB | NB | NB | NB | 1.48E −04 |
| 00290 | Valine, leucine, and isoleucine biosynthesis | SB | SE | NB | NB | 1.68E −05 |
| 00590 | Arachidonic acid metabolism | SB | NB | NB | NB | 2.11E −03 |
| 00600 | Sphingolipid metabolism | SB | SE | NB | SC | 8.86E −05 |
| 00730 | Thiamine metabolism | SB | NB | NB | NB | 2.62E −03 |
| 00983 | Drug metabolism—other enzymes | SB | NB | NB | NB | 2.62E −03 |
| 00195 | Photosynthesis | SB | SB | SC | SB | 2.74E −03 |
| 00254 | Aflatoxin biosynthesis | SC | SC | NB | SB | 1.03E −02 |
| (b) Common pathways between US and Swedish CD cohorts | ||||||
| 00500 | Starch and sucrose metabolism | SB | NB | SS | SS | 4.91E −06 |
| 00620 | Pyruvate metabolism | SB | NB | NB | SS | 4.05E −04 |
| 00640 | Propanoate metabolism | SB | NB | NB | NB | 9.04E −03 |
| 00290 | Valine, leucine, and isoleucine biosynthesis | SB | SS | NB | SS | 5.03E −03 |
| 00450 | Selenocompound metabolism | SB | NB | NB | NB | 8.95E −03 |
| 00460 | Cyanoamino acid metabolism | SB | NB | SS | SS | 8.33E −05 |
| 00513 | Various types of N-glycan biosynthesis | SB | NB | NB | NB | 5.79E −03 |
| 00710 | Carbon fixation in photosynthetic organisms | SB | NB | NB | SS | 1.09E −05 |
| 00410 | Beta-alanine metabolism | NB | SS | NB | SB | 5.79E −01 |
(a) Common pathways between Chinese and European T2D cohorts which have significantly altered read abundances. We found 13 shared pathways of which 12 are highly relevant to T2D; these pathways are significant in individual cohorts (BH-corrected Wilcoxon rank-sum test p value <0.05) as well as in Fisher’s combined test at p value <0.05 cutoff. On the other hand, mi-faser finds only the photosynthesis pathway and Kraken2 finds the photosynthesis and aflatoxin biosynthesis pathways to be commonly disrupted between both the cohorts; with HUMAnN2-profiles, no overlap at the pathway level was found (Additional file 2: Tables S11–S16). (b) Common pathways between the US and Swedish CD cohorts which have significantly altered read abundances. We identify shared dysbiosis in 8 pathways between the two study cohorts; these pathways are significant in individual cohorts as well as in Fisher’s combined test at p value <0.05 cutoff. On the other hand, only Kraken2 finds the beta-alanine metabolism pathway to be commonly disrupted between both the cohorts; with mi-faser- and HUMAnN2-profiles, no overlap at the pathway level was found (Additional file 3: Tables S23, S24, S27, S28, S31, and S32). SB significant in both the studies, NB detected but not significant in both the studies, SC significant in the Chinese cohort only, SE significant in European cohort only, SU significant in the US cohort only, SS significant in the Swedish cohort only
Fig. 2Classification of patients vs controls using Enzyme Commission (EC) markers (N-fold cross-validation experiments). a T2D vs controls in the T2D-Qin dataset (Chinese cohort). b T2D vs normal glucose tolerance (NGT) individuals in the T2D-Karlsson dataset (European cohort). c CD patients vs controls in the CD-HMP dataset (individuals from the US). d CD patients vs healthy individuals in the CD-Swedish dataset (Swedish twin studies). e PD vs controls in the PD-Bedarf dataset. In each trial, one of the N subsets was selected as the test set and the rest N−1 subsets were used as the training set. Differentially abundant ECs were selected from the training set as features input to a set of random forest classifiers. Performance of classification was measured on the test set. Carnelian-identified EC terms achieve a larger average area under the curve (AUC) in all the cases compared to those identified by other methods
Fig. 3Functional diversity and relatedness between industrialized and non-industrialized communities. a Heatmap showing the z-scores of read abundances of the ECs with high weights in the top principal components. Standard Ward-linkage hierarchical clustering of the EC profiles of industrialized and non-industrialized microbiomes was performed using Pearson correlation. The two top-level clusters found by hierarchical clustering perfectly capture the separation of non-industrialized and industrialized microbiomes. For display purposes, we show only individuals with read abundances falling outside one standard deviation of the mean in at least nine of the highly variable ECs. See Additional file 4: Figure S6 for the corresponding heatmap and clustering on all individuals. b Heatmap showing co-abundance association across core metabolic pathways. Co-abundance associations between pathways wee calculated as the pairwise Kendall rank correlations between the pathway abundance profiles (obtained using Carnelian-generated EC profiles) of microbiomes from both communities considered together. Ward-linkage hierarchical clustering was used to partition the pathways using Euclidean distance, generating either 2, 3, 4, or 5 clusters. Although hierarchical clustering can be used to identify clusters of co-abundance pathways between the non-industrialized vs industrialized communities, the clusters were not significantly different from each other with respect to the industrialized/non-industrialized label (PERMANOVA test p values > 0.05). Thus, in contrast to the top-level EC label clustering from a, the partitions are not simply recapitulating the industrialized/non-industrialized labels