Literature DB >> 34614189

metaGEM: reconstruction of genome scale metabolic models directly from metagenomes.

Francisco Zorrilla^1,2,3, Filip Buric¹, Kiran R Patil^2,3, Aleksej Zelezniak^1,4.

Abstract

Metagenomic analyses of microbial communities have revealed a large degree of interspecies and intraspecies genetic diversity through the reconstruction of metagenome assembled genomes (MAGs). Yet, metabolic modeling efforts mainly rely on reference genomes as the starting point for reconstruction and simulation of genome scale metabolic models (GEMs), neglecting the immense intra- and inter-species diversity present in microbial communities. Here, we present metaGEM (https://github.com/franciscozorrilla/metaGEM), an end-to-end pipeline enabling metabolic modeling of multi-species communities directly from metagenomes. The pipeline automates all steps from the extraction of context-specific prokaryotic GEMs from MAGs to community level flux balance analysis (FBA) simulations. To demonstrate the capabilities of metaGEM, we analyzed 483 samples spanning lab culture, human gut, plant-associated, soil, and ocean metagenomes, reconstructing over 14,000 GEMs. We show that GEMs reconstructed from metagenomes have fully represented metabolism comparable to isolated genomes. We demonstrate that metagenomic GEMs capture intraspecies metabolic diversity and identify potential differences in the progression of type 2 diabetes at the level of gut bacterial metabolic exchanges. Overall, metaGEM enables FBA-ready metabolic model reconstruction directly from metagenomes, provides a resource of metabolic models, and showcases community-level modeling of microbiomes associated with disease conditions allowing generation of mechanistic hypotheses.

Entities: Chemical

Mesh：

Year: 2021 PMID： 34614189 PMCID： PMC8643649 DOI： 10.1093/nar/gkab815

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Whole metagenome shotgun sequencing and genome-resolved metagenomics allow for the exploration of personalized, context-specific microbial communities at a species or strain level resolution (1). Changes of microbiota composition are strongly linked to a range of diseases including cancer, behavioral, neurological, and metabolic disorders (2–7). However, a mechanistic understanding of the role of the human gut microbiome in disease, especially the roles of specific strains and the associated metabolic factors, remains challenging due to the vast intra- and inter-species diversity. To this end, short-read sequencing data allows for the extraction of metagenome assembled genomes (MAGs) directly from raw sequencing data, while avoiding culture-based methods or making use of the limited number of reference genomes, enabling the discovery of unknown species and the exploration of personalized microbiomes (8–10). Indeed, a number of attempts aiming to explore the human gut microbiome composition and diversity have generated hundreds of MAGs representing previously unknown or uncultured species, as well as thousands of known species MAGs (11–13). Pangenome analysis of the human gut microbiome demonstrated that the functional repertoire of gut species differ significantly, with a median core genome proportion of only 66% (14), revealing differences in metabolic potentials of individual microbiomes. Attempts of mechanistic links between diet or inter-species interactions with microbiota composition and dynamics led to the development of gut species genome-scale metabolic models (15). GEnome-scale Metabolic models (GEMs) allow assessing species nutritional requirements (16) and their interactions in the human gut as well as in diverse environmental communities (17–20). The current paradigm of metabolic modeling typically relies on mapping identified taxa to their closest reference genomes. This limits analysis and its interpretation to the metabolic networks represented in the known reference genome space. This can cause false positives (i.e. pathways present in the reference but missing from the variant present in the community) as well as false negatives (i.e. pathways missing in the reference genomes but present in the community variant), ultimately leading to inaccurate predictions of individual species metabolism as well as that of cross-feeding interactions. Thus the current modelling attempts are likely failing to capture the specific metabolic features of a given species across different contexts, e.g. microbiota of individuals with different disease conditions. Towards overcoming this limitation, here we present metaGEM, the computational pipeline that enables reconstruction of sample specific metabolic models directly from short read metagenomics data. Instead of relying on reference genomes, metaGEM generates high quality metagenome assembled genomes, which are then used to reconstruct context-specific prokaryotic GEMs using state-of-the-art methodologies (Figure 1). Our contributions are 2-fold: (i) an end-to-end framework enabling community-level metabolic interaction simulations and (ii) a resource of >14 000 MAGs from a range of metagenomic biomes, including 3750 high quality MAGs, with corresponding FBA-ready GEMs from human gut microbiome studies (21,22) and global microbiome projects (23–25). The metaGEM pipeline is implemented in Snakemake (26), an open-source, community driven, and scalable bioinformatics workflow engine, supporting major popular high-performance-computing cluster environments as well as standalone systems (Supplementary Figure S1).

Figure 1.

Schematic of the metaGEM pipeline workflow highlighting tools, inputs, and outputs. Short reads are quality filtered and adapter trimmed using fastp (27). Quality controlled reads are assembled individually using MEGAHIT (28). Using either kallisto (31) or bwa (29), quality controlled reads are mapped to each assembly to obtain contig coverage information across samples. Coverage information and assemblies are used by CONCOCT (8), MetaBAT2 (10) and MaxBin2 (9) to generate three bin sets for each sample. The metaWRAP (32) bin_refinement module is used to dereplicate bin sets for each sample and find the highest quality version of each bin. The metaWRAP bin_reassemble module is used to extract quality controlled short reads mapping from the focal sample to the bin, which are used to generate two single genome assemblies using strict and permissive parameters. The original and reassembled versions are compared for quality, and the best version is kept. Refined and reassembled MAGs are used to generate GEMs using CarveMe (33). These models can be quality checked using MEMOTE (34). Community simulations are carried out for each sample using SMETANA (17). Other features include: taxonomic classification using mOTUs2 (36) and/or GTDB-Tk (35), a custom mapping-based abundance estimation module that does not make use of marker-gene or reference-genome based approaches, growth rate estimation for high coverage MAGs using GRiD (37), and pangenome analysis using prokka (38) and roary (39). EukRep (40) can be used to scan for eukaryotic contigs in the CONCOCT bin sets, which can then be processed by EukCC (41) to provide completeness, contamination, and taxonomic assignments for eukaryotic MAGs.

MATERIALS AND METHODS

Snakemake and HPCCs

metaGEM was implemented in Snakemake v5.10.0, and makes use of the bioinformatic tools listed in Table 1. The workflow was designed to analyze datasets independently from each other, while samples from the same datasets are processed in parallel. All Snakemake jobs were run on the Chalmers Center for Computational Science and Engineering (C3SE) and the European Molecular Biology Laboratory (EMBL) Heidelberg high performance computer clusters (HPCC).

Table 1.

List of tools used by metaGEM

Tool	Task	Repository
Snakemake v5.10.0 (26)	Workflow management	https://github.com/snakemake/snakemake
fastp v0.20.0 (27)	Short read QC filtering and adapter removal	https://github.com/OpenGene/fastp
MEGAHIT v1.2.9 (28)	Short read assembly	https://github.com/voutcn/megahit
bwa v0.7.17 (29)	Contig coverage	https://github.com/lh3/bwa
SAMtools v1.9 (30)	Contig coverage	https://github.com/samtools/samtools
kallisto v0.46.1 (31)	Contig coverage	https://github.com/pachterlab/kallisto
CONCOCT v1.1.0 (8)	Contig binning	https://github.com/BinPro/CONCOCT
MetaBAT2 v2.12.1 (10)	Contig binning	https://bitbucket.org/berkeleylab/metabat/src/master/
MaxBin2 v2.2.5 (9)	Contig binning	https://sourceforge.net/projects/maxbin2/
metaWRAP v1.2.3 (32)	Bin refinement and reassembly	https://github.com/bxlab/metaWRAP
CarveMe v1.2.2 (33)	GEM reconstruction	https://github.com/cdanielmachado/carveme
SMETANA v1.2.0 (17)	Community GEM simulation	https://github.com/cdanielmachado/smetana
MEMOTE v0.9.13 (34)	GEM quality report	https://github.com/opencobra/memote
GTDB-Tk v1.1.0 (35)	MAG taxonomy assignment	https://github.com/Ecogenomics/GTDBTk
mOTUs2 v2.5.1 (36)	MAG taxonomy assignment	https://github.com/motu-tool/mOTUs_v2, https://github.com/AlessioMilanese/classify-genomes
GRiD v1.3 (37)	MAG growth rate estimation	https://github.com/ohlab/GRiD
Prokka v1.14.6 (38)	MAG functional annotation	https://github.com/tseemann/prokka
Roary v3.13.0 (39)	Pangenome analysis	https://github.com/sanger-pathogens/Roary
EukRep v0.6.6 (40)	Identify eukaryotic MAGs	https://github.com/patrickwest/EukRep
EukCC v0.1.4.3 (41)	Eukaryotic MAG taxonomy and quality	https://github.com/Finn-Lab/EukCC

Metagenomic samples and short read quality control

A total of 483 whole metagenome shotgun samples from five metagenomic studies (21–25) were downloaded from the NCBI SRA or EBI ENA to the HPCC. The lab culture dataset was the only single end read set, while the remaining datasets consisted of paired end read sets. The fastp tool v0.20.0 was used for short read quality filtering and adapter removal using default settings.

Short read assembly

Single sample assemblies were obtained using MEGAHIT v1.2.9. The flag ‘–presets meta-sensitive’ was used on all assemblies, which is equivalent to setting ‘–min-count 1’ ‘–k-list 21,29,39,49,59,69,79,89,99,109,119,129,141’. The parameter ‘–min-contig-len’ was set to 1000 for the TARA Oceans dataset.

Contig coverage estimation and binning

For the lab culture, gut microbiome, plant associated, and soil datasets, bwa-mem v0.7.17 was used with default settings to cross map each quality controlled set of short reads to each generated assembly within a dataset. An index was created for each assembly prior to mapping using the ‘bwa index’ command with default settings. Each mapping operation resulted in a SAM file, which was converted to BAM format using the ‘samtools view’ command with the flags ‘-Sb’, and then sorted using the ‘samtools sort’ command with default settings. The sorted BAM files were then used to create the contig coverage across samples input files for MetaBAT2 v.2.12.1 and MaxBin2 v2.2.5 with the MetaBAT2 script ‘jgi_summarize_bam_contig_depths’ (default parameters). The CONCOCT script ‘cut_up_fasta.py’ with parameters ‘-c 10000 -o 0 -m -b’ was used to generate a BEDfile for each assembly, with contigs cut into 10 kb chunks. The generated BEDfile and sorted BAM files were used by the CONCOCT script ‘concoct_coverage_table.py’ (default settings) to generate the contig coverage across samples input files for CONCOCT. CONCOCT v1.1.0 was run for each cut up assembly, using the contig coverage across samples as input and with the ‘-c 800’ parameter. The CONCOCT script ‘extract_fasta_bins.py’ script was then run with default settings to extract bins in terms of the original uncut contigs. MetaBAT2 v.2.12.1 was run using the ‘metabat2’ command for each assembly, using the contig coverage across samples as input and with default settings. MaxBin2 v2.2.5 was run using the ‘run_MaxBin.pl’ for each assembly, using the contig coverage across samples as input and with default settings. For the TARA oceans dataset, the kallisto v0.46.1 tool was used for mapping quality controlled paired end reads to assemblies. First, each assembly was cut up into 10 kb chunks using the CONCOCT ‘cut_up_fasta.py’ script with parameters ‘-c 10000 -o 0 -m’. The cut up assembly was then used to generate a kallisto index using the ‘kallisto index’ command with default settings. Next, the ‘kallisto quant’ command was used with the ‘–plaintext’ setting to cross map each sample set of quality controlled paired end reads to each cut up assembly. Finally, the ‘kallisto2concoct.py’ script was used to summarize the mapping results across samples for each set of assembled contigs. For the TARA oceans dataset the binners were used as described above, but only CONCOCT used contig coverage across samples as input, while MetaBAT2 and MaxBin2 only used contig coverage from the assembled sample as input.

Bin refinement and reassembly

To reconcile and dereplicate the three generated binner outputs, the metaWRAP metaWRAP v1.2.3 ‘bin_refinement’ command was used with parameters ‘-x 10 -c 50’, corresponding to a maximum bin contamination threshold of 10% and a minimum bin completeness threshold of 50% (i.e. medium quality bin criteria) based on CheckM estimates. Note that CheckM makes use of a database of reference genomes for the calculation of completeness and contamination scores. The metaWRAP ‘reassemble_bins’ command was used with parameters ‘-x 10 -c 50’ to improve bin quality whenever possible by independently reassembling reads that map from the parent sample's quality controlled short reads to the refined bin. Refined and reassembled bins are hereafter referred to as metagenome assembled genomes (MAGs), although the terms may be used interchangeably.

MAG abundance quantification and taxonomic assignment

Absolute and relative abundances of MAGs were calculated using bwa v0.7.17 and SAMtools v1.9. First, all MAGs generated from the same sample were concatenated into a single fasta file, based on which an index was created using ‘bwa index’ (default parameters). Next, ‘bwa-mem’ (default parameters) was used to map the quality controlled short reads from the parent sample to the concatenation of generated MAGs. The resulting SAM file was converted to BAM format and sorted using ‘samtools view’ and ‘samtools sort’, respectively, using the same parameters as described above. Next, ‘samtools flagstat’ was used (default parameters) to extract the number of reads that mapped from the quality controlled paired end reads to the concatenation of all MAGs generated in that sample. Next, for each MAG, an index was created using ‘bwa index’ (default parameters), the quality controlled short reads were mapped to the MAG using ‘bwa-mem’ (default parameters), the resulting SAM file was converted to BAM format and sorted as described above, and the number of reads mapping to the MAG from the short reads was extracted using ‘samtools flagstat’. The abundance of a given MAG was estimated by dividing the number of quality controlled reads that map to the MAG by the number of quality controlled reads that map to the concatenation of all MAGs from that sample, divided by the megabase pair length of the MAG. For each sample, these non-normalized abundances were summed to obtain a sample specific normalizing factor. To obtain normalized relative abundances, each non-normalized abundance was divided by the sample specific normalization factor. The mOTUs2 v2.5.1 tool was also used to calculate abundances, for comparison with the above mapping based method, only in the lab culture dataset. The ‘motus profile’ command was used with default settings. GTDB-Tk v1.1.0 was used to assign taxonomic labels to the generated MAGs using the ‘gtdbtk classify_wf’ with default settings. Both mOTUs2 and GTDB-Tk make use of reference genomes and/or marker genes for taxonomic classification.

Genome scale metabolic model reconstruction and quality reports

CarveMe v1.2.2 was used to automatically reconstruct genome scale metabolic models from ORF annotated protein fasta files derived from MAGs using the default CPLEX solver v12.8. The ‘carve’ command was run using the ‘–fbc2 -g’ flags to gapfill models on complete media and generate FBC2 format models. The MEMOTE v0.9.1 tool was then used to generate quality reports for each genome scale metabolic model. The ‘memote run’ command was used with the flags ‘–skip test_find_metabolites_produced_with_closed_bounds –skip test_find_metabolites_consumed_with_closed_bounds –skip test_find_metabolites_not_produced_with_open_bounds –skip test_find_metabolites_not_consumed_with_open_bounds –skip test_find_incorrect_thermodynamic_reversibility’ to avoid running time consuming tests.

Community simulations

The SMETANA v1.2.0 tool was used for simulating gut microbiome communities of reconstructed genome scale metabolic models. The ‘smetana’ command was used with the flags ‘–flavor fbc2 –detailed –mediadb media_db.tsv -m M11’ and using the default CPLEX solver v12.8. The simulation media was the same as was used for gapfilling (full media, M3) minus aromatic amino acids (M11). The media file was obtained from the authors of previous publication (16), and can be accessed from the metaGEM GitHub repository (https://github.com/franciscozorrilla/metaGEM). Regarding the implementation specifics, SMETANA is formulated as a mixed linear integer problem (MILP) that enumerates all possible essential metabolic exchanges within a community of N species that sustain a non-zero growth of all N species subject to a mass balance constraint. SMETANA goes beyond pairwise comparison and does constraint based analysis for all members in the microbial community simultaneously. SMETANA does use any biological objective functions. Further details about implementation and specific algorithms used within the SMETANA framework are available in the original publication (17).

Additional features in the metaGEM pipeline

Although not discussed in detail, there are several additional features that were incorporated into metaGEM that may be useful to users. Growth rates for medium and high coverage MAGs can be estimated using the GRiD v1.3 tool. Prokka v1.14.6 can be used to functionally annotate MAGs and the output can be provided to Roary v3.13.0 in order to visualize the core and pangenome of a set of MAGs. Communities with suspected eukaryotic MAGs can be further probed by scanning for eukaryotic contigs in the CONCOCT bin set using EukRep v0.6.6. Identified eukaryotic bins can then be analyzed by EukCC v0.1.4.3, to obtain completeness and contamination estimates as well taxonomic lineage estimates.

RESULTS

Implementation and features of metaGEM pipeline

The metaGEM pipeline starts from quality filtering and single sample assembly of short read FASTQ data, and proceeds to the reconstruction of metagenome assembled genomes using three different binners: CONCOCT (8), MetaBAT2 (10), MaxBin2 (9). The three output draft bin sets are then refined (i.e. de-replicated) and reassembled using metaWRAP to obtain the highest quality version of each bin (32) (Supplementary Figure S2). Note that metaWRAP completeness and contamination estimates are based on a marker-gene approach used by CheckM (42). These final prokaryotic MAGs are used to automatically generate FBA-ready genome-scale metabolic reconstructions using CarveMe (33), which are quality checked using the MEMOTE (34) framework. By integrating the Species METabolic ANAlysis (SMETANA) framework (17) (https://github.com/cdanielmachado/smetana), the generated GEMs can be used for sample specific community-level metabolic interaction modeling. Other pipeline features include the automatic assignment of taxonomic classification to the reconstructed MAGs and GEMs, the calculation of relative and absolute abundance for generated MAGs, the estimation of growth rate (37) for high coverage MAGs, and pangenome analysis (39) (Figure 1). Although there is a growing number of MAG reconstruction pipelines (43–49), a simple comparison of studies that used differing methods for MAG generation revealed that metaGEM consistently recovers more high quality genomes per sample, both from gut microbiome (11–13) and ocean (50–52) metagenomes (Supplementary Figure S3). Indeed, metaGEM has already been applied in one of our recent studies to interrogate the plastic-degrading potential of the global microbiome (53). Regarding tool selection for specific tasks such as short read quality filtering, assembly, binning, etc., there are a number of published stand-alone candidate tools to choose from in the literature. Specific tools were chosen based on a combination of factors including: user-friendliness, computational efficiency, performance (i.e. quality of output), publication date, quality of documentation, development activity, technical support availability, and size of user community. The flexibility of the Snakemake framework allows easy expansion of the metaGEM toolset to add new features or alternative tools. To demonstrate the versatility of metaGEM pipeline, we reconstructed genome-scale metabolic models (GEMs) from five metagenome datasets spanning different biological and technical complexity, namely: gut microbiomes (21,22), plant associated microbiomes (23), global ocean microbiomes (24), and bulk soil microbiomes (25). In total, we reconstructed 3750 high quality metagenomic assemblies (MAGs) with >90% completeness & <5% contamination and 10 349 medium quality MAGs with >50% completeness and <10% contamination (Supplementary Figure S4). We assessed the quality of MAG reconstruction by recovering genomes from a controlled multispecies lab culture experiment (21). Briefly, 7 human gut microbiome species were grown in vitro across four biological replicates with 12 time points totaling 48 metagenomic samples. A total of 154 MQ MAGs, of which 137 also meet the HQ MAG standard, were reconstructed with an average completeness of 95.4% and average contamination of 0.3%, with on average ∼3.2 MAGs per sample reflecting the growth curves shown in the original publication, which are initially dominated by very few species and ultimately just one (Supplementary Figure S5). Abundance estimates generated using a mapping-based approach (Materials and Methods) were perfectly correlated to marker gene based abundance estimates (Pearson's r = 0.99, P-value < 1e−16) well recapitulating experimental observations (Figure 2A).

Figure 2.

Abundance, quality, and diversity comparisons of reconstructions. (A) Abundance estimates generated by metaGEM using a mapping based approach compared to marker gene based approach of mOTUs2 in small lab culture communities dataset. (B) Distribution of genes, reactions and metabolites in genome scale metabolic models across AGORA (15), BiGG (54), EMBL GEMs (33), KBase (49), medium quality (MQ) metaGEM and high quality (HQ) metaGEM sets. (C) Distribution of metabolic distances between a set of 200 randomly chosen reference EMBL GEMs compared to a randomly chosen set of 800 EMBL GEMs and 800 randomly chosen metaGEMs from the gut microbiome dataset. (D) Cumulative core and pan genome curves for the top 10 most commonly reconstructed gut microbiome species based on EC numbers present in the reconstructed genome scale metabolic models. (E) Comparison of EC numbers between 165 species reconstructed from the gut microbiome dataset and also found in the AGORA collection. Inset venn diagram shows average value of EC numbers unique to the compared sets as well as their average intersect.

High-quality metabolic reconstructions directly from metagenomes

We next reconstructed a total of 14 087 GEMs from the metagenome assembled genomes (MAGs) (Materials and Methods) and compared them to highly curated reference genome-based BiGG models (54), AGORA (15), EMBL (33) and KBase models (49). In terms of the number of metabolic reactions and unique metabolites, the metagenomic GEMs show similar distributions compared to reference-based GEM reconstructions (Figure 2B). Specifically, GEMs derived from high quality MAGs had average percent differences of 1.6%, 0.3% and 10.0% in number of metabolites and 3.7%, 21.1%, 44.3% in number of reactions respectively compared to reference genome-based EMBL, AGORA and KBase models, and had average percent differences of 4.1% less metabolites and 40.1% less in reactions compared to manually curated BiGG models, which overall display a higher number of reactions and metabolites. In terms of the number of genes present in the models, GEMs derived from high quality MAGs had between 9.5% to 19.9% less genes as compared to the reference genome-based collections, with the exception of BIGG models that involved over 40% more gene annotations. Furthermore, metaGEMs reconstructed using high quality MAGs had average percent differences of 24.6%, 12.8% and 10.3% compared to medium quality MAG-based metaGEMs in terms of their number of genes, reactions, and metabolites, suggesting that the metabolic reconstruction process (33) is robust towards bin completion. Separating metagenomic models by dataset shows that the distributions of metabolites, reactions, and genes are similar respectively across datasets as well (Supplementary Figure S6). We also evaluated whether the metabolic reaction diversity identified in metagenomes would be comparable to the expected enzymatic diversity from the reference genome reconstructions, i.e. if expected reactions and pathways would be present in metagenomes based GEMs. For this we randomly sampled 1000 metabolic models from each, reference-genome-based and metagenome-based reconstructions, and computed metabolic similarity (expressed as Jaccard index) to a random 20% held-out set of reference-genome-based reconstructions (Figure 2C). Metagenomic GEMs and reference genome based EMBL GEMs had an average 4.2% difference (Wilcoxon rank sum test P-value < 2.2e−16) in metabolic reactions compared to an independent subset of metabolic reconstructions from reference based genomes, suggesting that metabolic models derived from metagenome assembled genomes capture expected metabolic diversity and features as reconstructions from reference-based genomes (Materials and Methods). Performing PCA on the presence/absence of EC numbers across models resulted in AGORA and gut metagenomic models clustering near each other, while EMBL and ocean metaGEM models clustered closer to each other (Supplementary Figure S7).

Pan-metabolism of metagenomic GEMs uncovers metabolic diversity within species

Recent gut metagenomic microbial population studies suggest that strain-level diversity within the same species can differ by over 20% of genome content between individuals (55). With more extreme examples from environmental isolates of soil myxobacteria with only 30% conserved core part of genomes showing extreme gene diversity within the same species (56). These analyses imply that current reference based approaches (15) may not always reflect the metabolic diversity found across a species’ pangenome or pan-metabolism. To investigate whether reconstructed metagenomic GEMs are able to capture this intra-species metabolic diversity we compared intra-species EC numbers (i.e. the pan-metabolism) diversity across the top ten most prevalent species (Figure 2C). Pan-metabolism analysis confirmed that no two models were exactly the same with respect to their EC number content. Indeed, the core genome of metabolism in the analyzed species ranged between 38.5–57.6% of their respective pangenomes, in line with previously reported degree of intraspecies genetic variation in prokaryotes (14,55). We also analyzed the pangenome of 141 taxonomically undefined species, where we found the core genome to be 6.9% of the pan-species pan-genome (Supplementary Figure S8). To exemplify further, we compared unique EC numbers between metagenome based GEMs and reference genome based gut AGORA models (15) for a total of 165 matched gut microbiome species. The intersect of AGORA and metaGEM EC numbers range between 48.9% to 69% for all matched species, with metaGEM models containing more EC numbers than their corresponding AGORA model in 53.9% of cases, while AGORA models contained more EC numbers 46.1% of comparisons. Inspection of the KEGG pathway annotations of the EC numbers present exclusively in AGORA or metagenomic gut models reveals that the majority of these enzymes are associated with biosynthesis of secondary metabolites and antibiotics (Supplementary Figure S9). Finally, we note that there is little overlap between identified species based on phylogenetic marker genes and AGORA models, while we observed 10-fold greater overlap for metaGEM models (Supplementary Figure S10).

metaGEMs enables modeling of personalized human gut communities

To investigate potential microbial metabolic interactions in healthy and metabolically impaired type 2 diabetes human gut microbiomes (22), we reconstructed a total of 4127 of personalized human gut metaGEMs across 137 metagenomes that were classified according to the disease condition of participants from the original study, i.e. normal glucose tolerance (NGT, n = 42), impaired glucose tolerance (IGT, n = 42) or type 2 diabetic (T2D, n = 53). We then applied Species Metabolic Coupling Analysis (SMETANA), a constraint-based technique for modeling interspecies dependencies in microbial communities (17), to elucidate the potential microbial metabolic interactions in healthy and metabolically impared patients. Briefly, SMETANA outputs a set of scores for each community, corresponding to measures of strength of cross-feeding interactions that should take place to support the growth of community members in a given condition, i.e. a likelihood of species A growth depending on metabolite X from species B. Specifically, we found 22 compound exchanges that were significantly different (Wilcoxon rank sum test, BH adjusted P-values < 0.01) between the disease groups, representing metabolites from multiple classes including organic acids and lipid-like molecules (Supplementary Figures S11 and S12). Additionally, we identified 27 donors and 27 receivers that had statistically significant distributions of SMETANA scores between disease groups (Wilcoxon rank sum test, BH adjusted P-value < 0.0001), including genera that have been associated with T2D (57) (Supplementary Figures S13 and 14) such as Bifidobacterium, Faecalibacterium, Roseburia, Ruminococcus and Blautia. Visualization of 10 compounds with the lowest P-values exchanged by the eight most frequent donors or receivers of these metabolites shows notable differences in metabolic architecture across conditions (Figure 3A). For example, in the visualized data subset, exchanges of l-malic acid undecaprenyl diphosphate were observed 2.25 and 2.75 times more frequently, and with a 115-fold and 3.9-fold higher average SMETANA score (Wilcoxon rank sum test, BH adjusted P-value = 1.8e−05 and 1.0e−03 respectively) respectively in T2D communities compared to NGT. Although exchanges of nitrite and hydrogen sulfide were observed 1.6 and 1.4 times more frequently in T2D compared to NGT communities, exchanges in the latter had a 5.5-fold and and 5.7-fold higher SMETANA score respectively compared to T2D communities (Wilcoxon rank sum test, BH adjusted P-value = 3.3e−05 and 5.8e−04 respectively). Visualization of average SMETANA scores grouped by broad metabolite class for interactions involving Faecalibacterium prausnitzii C as a receiver (Figure 3B, C) also highlight differences in metabolism, with a 3.7-fold stronger dependency on organic oxygen compounds and a 2.2-fold stronger dependency on nucleosides, nucleotides, and analogues in NGT communities compared to T2D. Also of note is the fact that Faecalibacterium prausnitzii did not demand any inorganic compounds in NGT communities, while overall it had higher dependency on inorganic compounds (nitrite) in IGT communities compared to T2D.

Figure 3.

SMETANA simulations uncover differences in metabolism across conditions. (A) Alluvial diagram showing top 10 compounds exchanged with statistical significance across conditions between eight species, representing 279 interactions (NGT n = 61, IGT n = 58, T2D n = 160) across 41 samples (NGT n = 12, IGT n = 12, T2D n = 17). Thickness of lines are proportional to magnitude of SMETANA score. (B) Radar plot of average SMETANA scores based on 543 interactions (NGT n = 50, IGT n = 161, T2D n = 249) across 18 samples (NGT n = 4, IGT n = 6, T2D n = 8) grouped by metabolite class across conditions for receiver Faecalibacterium prausnitzii C. (C) Network diagrams of interactions involving Faecalibacterium prausnitzii C (centered in each subgraph) as a receiver across conditions. Thickness of lines are proportional to magnitude of SMETANA score.

DISCUSSION

The latest genome sequencing studies suggest that bacterial genomes are highly dynamic (58), exhibiting a high exchange of genetic material between strains of the same species or between different bacterial species. With multiple examples where specific strains play key roles in disease pathogenesis (59) or niche metabolic adaptation (56), it becomes apparent that bacteria should be analyzed by the context specific functional repertoire (including metabolism) of strains and not merely by their species membership (60). The highly diverse pangenome-derived functional repertoire (61,62) of microbial communities, the interpersonal differences in gene content of human gut bacteria (55), and the strain specific plasticity of metabolic adaptation (56) cannot be captured by either amplicon sequencing nor by using reference genomes. In other words, while reference-based GEMs may ensure completeness, they only represent a single point in the pangenome landscape of a species. To address these limitations, and to further enable the interrogation of functional and metabolic diversity existing within microbial communities derived from metagenomes, here we developed metaGEM, a pipeline for reconstruction and metabolic modeling of multispecies microbial communities derived directly from metagenomic samples. In short, the pipeline generates metagenome assembled genomes from metagenomic data which are subsequently used to reconstruct and simulate genome scale metabolic models (GEMs) in their communities. We showed the versatility and usefulness of the metaGEM pipeline by generating metagenome assembled genomes, annotating their taxonomy, calculating relative abundances, reconstructing genome scale models, and simulating metabolic interactions in microbial communities from a range of metagenomic datasets including lab cultures, human gut, plant associated, soil and ocean metagenomes. Notably, in small metagenomic communities from lab cultures nearly 90% of generated MAGs were of high quality (>90% of genome completeness and <5% contamination), and calculated MAG abundance estimates were highly correlated to marker gene-based estimates (Figure 2A). While metagenomes from more complex communities yielded more MAGs, they pose a more challenging assembly and binning scenario, resulting in a lower percentage of high quality MAGs (Supplementary Figure S4). By comparing the generated metabolic models to previously published reference based GEM collections we demonstrated that metaGEMs have a comparable number of reactions and metabolites, despite tending to have fewer genes (Figure 2B). This suggests that some of these reference based reconstructions may contain extraneous metabolic information. Furthermore, by calculating pairwise metabolic distance estimates between models, we show that metaGEMs capture a similar distribution of enzymatic diversity as compared to reference genome based reconstructions (Figure 2C). While the metagenomic models are quality checked for basic functionality using the MEMOTE test suite, passing these tests does not necessarily ensure they can correctly predict fluxes specific to the environment where they were isolated from. We demonstrated that metaGEMs capture a large degree of intraspecies variation by analyzing the core and pangenomes of the top 10 most prevalent species from the gut microbiome dataset. Indeed, no two models from the same species were identical in terms of their metabolism, with up to 60% of metabolic diversity present within species pangenomes, demonstrating remarkable degree of intraspecies metabolic variation captured by metaGEM models (Figure 2D). Furthermore, by comparing metaGEMs with reference-based gut species metabolic AGORA models (15), we showed that reference-based models introduce metabolic reactions that may not necessarily be present in every metagenomic context, while the metaGEM models reconstruct context specific metabolism, entirely based on actual metagenomic data that would have been otherwise missing from the reference-based reconstructions (Figure 2E). Indeed, we find that the most common pathways found exclusively in either metaGEM models and AGORA correspond to the biosynthesis of antibiotics and secondary metabolites, which are known to be context specific and horizontally transferred (63). We showed that gut metagenomes corresponding to different type 2 diabetes disease groups (NGT, IGT, T2D) (22) generate communities with different metabolic architectures. By carrying out species metabolic coupling analysis (17) for each metagenome-derived personalized community of GEMs, we identified 22 growth-related metabolic exchanges, 27 donor species, and 27 receiver species that were significantly different between at least one disease condition comparison (Supplementary Figures S11, S13, S14). Visualization of SMETANA simulation results revealed notable differences in the metabolic idiosyncrasies of species and communities as a function of their disease state condition, as reflected by scores of their metabolic dependencies (Figure 3). Seven of the twenty-two predicted metabolites (D-alanine, anhgm, murein5p5p, murein5px4p, uaagmda, udcpdp and ucdpp) are implicated in peptidoglycan and cell wall biosynthesis. While these predicted metabolite exchanges are required for microbial biomass generation, it should be noted that peptidoglycan (also known as lipopolysaccharide) exchanges in the gut microbiome have been linked to insulin resistance and increased risk of T2D (64,65). Seven predicted carbohydrate family metabolites (4-aminobutanoate, acetaldehyde, l-arabinose, carbon dioxide, fumarate, d-galactose and l-malate) of which the short chain fatty acid 4-aminobutanoate (also known as GABA), l-malate, l-arabinose and d-galactose have all been previously linked with insulin sensitivity and metabolic syndrome (4,64,66–70). Another group of predicted metabolites from the aromatic amino acid family (benzoate, chorismate and l-tryptophan) with tryptophan metabolism has been implicated in insulin resistance (71). Two of the metabolites (nitrite and ornithine) form nitrogen metabolism, in accord with the nitrite being reported to have a positive effect on insulin secretion (72). Finally, while the role of hydrogen sulfide in T2D metabolism remains to be determined, there have been several reported associations of H2S with insulin sensitivity found in the literature (73,74). The majority of the predicted differentially exchanged metabolites are thus corroborated by previous experimental findings, supporting that metaGEM is able to infer phenotype-relevant metabolic networks and metabolite exchanges within communities assembled from metagenomes. By directly modeling community level sample-specific microbiomes here we identified potential interaction differences between pathogenicity of type 2 diabetes that are based purely on metabolic capacity of the gut communities and are not dependent on species abundance estimates, thus providing additional information that would not be otherwise accessed. Moreover, the SMETANA framework is free from arbitrary assumptions of growth optimality, instead, it evaluates all scenarios of interspecies metabolic exchanges that support the growth of member species in a given community providing unbiased community level metabolism analysis (17). The framework has been validated previously (16,17,75,76) by reproducing experimentally determined interactions in well-studied microbial communities (77,78). Nevertheless, the field of metabolic modelling in ecology scale microbial communities is hypothesis-generating, aiming to propose novel insights into microbiome interactions within ecosystems. The predictions need to be supported by experimental approaches and orthogonal analyses, especially for large-scale microbiomes. Overall, our study offers an end-to-end framework to study sample-specific metabolism of complex microbial communities directly from metagenomic data without relying on reference genomes. We therefore envisage that metaGEM will become an important tool for deciphering metabolic interactions in complex microbial communities.

DATA AVAILABILITY

All FBA-ready metagenome-based metabolic reconstructions were deposited to the Zenodo repository and are available at http://doi.org/10.5281/zenodo.4407746. The code of metaGEM can be accessed at https://github.com/franciscozorrilla/metaGEM. Click here for additional data file.

76 in total

1. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets.

Authors: Yu-Wei Wu; Blake A Simmons; Steven W Singer
Journal: Bioinformatics Date: 2015-10-29 Impact factor: 6.937

2. Generation of genome-scale metabolic reconstructions for 773 members of the human gut microbiota.

Authors: Stefanía Magnúsdóttir; Almut Heinken; Laura Kutt; Dmitry A Ravcheev; Eugen Bauer; Alberto Noronha; Kacy Greenhalgh; Christian Jäger; Joanna Baginska; Paul Wilmes; Ronan M T Fleming; Ines Thiele
Journal: Nat Biotechnol Date: 2016-11-28 Impact factor: 54.908

3. Near-optimal probabilistic RNA-seq quantification.

Authors: Nicolas L Bray; Harold Pimentel; Páll Melsted; Lior Pachter
Journal: Nat Biotechnol Date: 2016-04-04 Impact factor: 54.908

Review 4. The gut microbiome in neurological disorders.

Authors: John F Cryan; Kenneth J O'Riordan; Kiran Sandhu; Veronica Peterson; Timothy G Dinan
Journal: Lancet Neurol Date: 2019-11-18 Impact factor: 44.182

5. Binning metagenomic contigs by coverage and composition.

Authors: Johannes Alneberg; Brynjar Smári Bjarnason; Ino de Bruijn; Melanie Schirmer; Joshua Quick; Umer Z Ijaz; Leo Lahti; Nicholas J Loman; Anders F Andersson; Christopher Quince
Journal: Nat Methods Date: 2014-09-14 Impact factor: 28.547

6. Sunbeam: an extensible pipeline for analyzing metagenomic sequencing experiments.

Authors: Erik L Clarke; Louis J Taylor; Chunyu Zhao; Andrew Connell; Jung-Jin Lee; Bryton Fett; Frederic D Bushman; Kyle Bittinger
Journal: Microbiome Date: 2019-03-22 Impact factor: 14.650

7. MAGpy: a reproducible pipeline for the downstream analysis of metagenome-assembled genomes (MAGs).

Authors: Robert D Stewart; Marc D Auffret; Timothy J Snelling; Rainer Roehe; Mick Watson
Journal: Bioinformatics Date: 2019-06-01 Impact factor: 6.937

8. Gamma-aminobutyric acid-producing lactobacilli positively affect metabolism and depressive-like behaviour in a mouse model of metabolic syndrome.

Authors: E Patterson; P M Ryan; N Wiley; I Carafa; E Sherwin; G Moloney; E Franciosi; R Mandal; D S Wishart; K Tuohy; R P Ross; J F Cryan; T G Dinan; C Stanton
Journal: Sci Rep Date: 2019-11-08 Impact factor: 4.379

9. The reconstruction of 2,631 draft metagenome-assembled genomes from the global oceans.

Authors: Benjamin J Tully; Elaina D Graham; John F Heidelberg
Journal: Sci Data Date: 2018-01-16 Impact factor: 6.444

10. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies.

Authors: Dongwan D Kang; Feng Li; Edward Kirton; Ashleigh Thomas; Rob Egan; Hong An; Zhong Wang
Journal: PeerJ Date: 2019-07-26 Impact factor: 2.984

8 in total

Review 1. Ecological modelling approaches for predicting emergent properties in microbial communities.

Authors: Naomi Iris van den Berg; Daniel Machado; Sophia Santos; Isabel Rocha; Jeremy Chacón; William Harcombe; Sara Mitri; Kiran R Patil
Journal: Nat Ecol Evol Date: 2022-05-16 Impact factor: 19.100

2. The National Ecological Observatory Network's soil metagenomes: assembly and basic analysis.

Authors: Zoey R Werbin; Briana Hackos; Jorge Lopez-Nava; Michael C Dietze; Jennifer M Bhatnagar
Journal: F1000Res Date: 2021-04-19

3. KEMET - A python tool for KEGG Module evaluation and microbial genome annotation expansion.

Authors: Matteo Palù; Arianna Basile; Guido Zampieri; Laura Treu; Alessandro Rossi; Maria Silvia Morlino; Stefano Campanaro
Journal: Comput Struct Biotechnol J Date: 2022-03-26 Impact factor: 7.271

4. Plastic-Degrading Potential across the Global Microbiome Correlates with Recent Pollution Trends.

Authors: Jan Zrimec; Mariia Kokina; Sara Jonasson; Francisco Zorrilla; Aleksej Zelezniak
Journal: mBio Date: 2021-10-26 Impact factor: 7.867

5. Systems Biology of Gut Microbiota-Human Receptor Interactions: Toward Anti-inflammatory Probiotics.

Authors: Lokanand Koduru; Meiyappan Lakshmanan; Shawn Hoon; Dong-Yup Lee; Yuan Kun Lee; Dave Siak-Wei Ow
Journal: Front Microbiol Date: 2022-03-03 Impact factor: 5.640

6. The hidden simplicity of metabolic networks is revealed by multireaction dependencies.

Authors: Anika Küken; Damoun Langary; Zoran Nikoloski
Journal: Sci Adv Date: 2022-03-30 Impact factor: 14.136

7. METABOLIC: high-throughput profiling of microbial genomes for functional traits, metabolism, biogeochemistry, and community-scale functional networks.

Authors: Zhichao Zhou; Patricia Q Tran; Adam M Breister; Yang Liu; Kristopher Kieft; Elise S Cowley; Ulas Karaoz; Karthik Anantharaman
Journal: Microbiome Date: 2022-02-16 Impact factor: 14.650

8. Competition-cooperation in the chemoautotrophic ecosystem of Movile Cave: first metagenomic approach on sediments.

Authors: Iulia Chiciudean; Giancarlo Russo; Diana Felicia Bogdan; Erika Andrea Levei; Luchiana Faur; Alexandra Hillebrand-Voiculescu; Oana Teodora Moldovan; Horia Leonard Banciu
Journal: Environ Microbiome Date: 2022-08-17

8 in total