Stephen Nayfach1, Michael A Fischbach2, Katherine S Pollard1. 1. Integrative Program in Quantitative Biology, Gladstone Institutes, and Division of Biostatistics, University of California San Francisco and. 2. Department of Bioengineering and Therapeutic Sciences and California Institute for Quantitative Biosciences, University of California San Francisco, San Francisco, CA, USA.
Abstract
UNLABELLED: Microbiome researchers frequently want to know how abundant a particular microbial gene or pathway is across different human hosts, including its association with disease and its co-occurrence with other genes or microbial taxa. With thousands of publicly available metagenomes, these questions should be easy to answer. However, computational barriers prevent most researchers from conducting such analyses. We address this problem with MetaQuery, a web application for rapid and quantitative analysis of specific genes in the human gut microbiome. The user inputs one or more query genes, and our software returns the estimated abundance of these genes across 1267 publicly available fecal metagenomes from American, European and Chinese individuals. In addition, our application performs downstream statistical analyses to identify features that are associated with gene variation, including other query genes (i.e. gene co-variation), taxa, clinical variables (e.g. inflammatory bowel disease and diabetes) and average genome size. The speed and accessibility of MetaQuery are a step toward democratizing metagenomics research, which should allow many researchers to query the abundance and variation of specific genes in the human gut microbiome. AVAILABILITY AND IMPLEMENTATION: http://metaquery.docpollard.org. CONTACT: snayfach@gmail.comS UPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
UNLABELLED: Microbiome researchers frequently want to know how abundant a particular microbial gene or pathway is across different human hosts, including its association with disease and its co-occurrence with other genes or microbial taxa. With thousands of publicly available metagenomes, these questions should be easy to answer. However, computational barriers prevent most researchers from conducting such analyses. We address this problem with MetaQuery, a web application for rapid and quantitative analysis of specific genes in the humangut microbiome. The user inputs one or more query genes, and our software returns the estimated abundance of these genes across 1267 publicly available fecal metagenomes from American, European and Chinese individuals. In addition, our application performs downstream statistical analyses to identify features that are associated with gene variation, including other query genes (i.e. gene co-variation), taxa, clinical variables (e.g. inflammatory bowel disease and diabetes) and average genome size. The speed and accessibility of MetaQuery are a step toward democratizing metagenomics research, which should allow many researchers to query the abundance and variation of specific genes in the humangut microbiome. AVAILABILITY AND IMPLEMENTATION: http://metaquery.docpollard.org. CONTACT: snayfach@gmail.comS UPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
A number of large-scale shotgun metagenomics projects have been made publicly available, enabling researchers to investigate the functional composition of microbial communities from the human body and how microbial functions correlate with disease or other traits (Aagaard ; Li ). A common goal of many microbiome studies is to quantify the abundance of specific genes across these publicly available datasets. In most cases, this task involves (1) downloading metagenomes from public repositories, (2) mapping reads to a reference database and (3) estimating gene abundances. For example, this was the approach used by Donia to estimate the abundance of 14 000 biosynthetic gene clusters in the human microbiome. However, this approach is time-consuming and computationally demanding—requiring large amounts of storage space and processing power—and is therefore not practical for many research groups.In an attempt to address this issue, several microbiome studies have made their functional annotations publicly available. For example, the Human Microbiome Project (HMP) Data Analysis and Coordination Center provides the abundance of KEGG Orthology Groups across 649 metagenomes from the human microbiome. Other studies have provided similar resources for other samples and databases. While useful, these resources only represent a small proportion of available samples and their annotations typically cover only a small fraction of genes in most metagenomes. For example, only 36% of reads from the HMP were mapped to a KEGG Orthology Group (Abubucker ). Furthermore, databases such as KEGG use unsupervised methods to cluster genes into ortholog groups that may not track with protein function (Schnoes ).More recently, there have been several efforts to create comprehensive gene catalogs that cover a much higher proportion of genes in the gut microbiome. Most notably, Li used 1267 samples from six different studies together with 511 genomes from gut microbiota to assemble a gene catalog of 9.9 million non-redundant genes. MetaQuery leverages this existing resource in order to provide users the ability to rapidly estimate the abundance of one or more query sequences across 1267 fecal metagenomes. Instead of re-mapping metagenomic reads for each query, reads were mapped once to the gene catalog, which can then be queried many times. Our framework allows the specification of sequence homology thresholds, which enable the user to define the relationship between sequence similarity and function. Finally, we use a set of 30 universal single-copy genes to normalize gene abundances to eliminate biases due to average genome size and database coverage (Manor and Borenstein, 2015; Nayfach and Pollard, 2015). This simple yet efficient framework has the potential to make large-scale metagenomics research accessible to a greater number of microbiome researchers.
2 Implementation
2.1 Gene abundance estimation
MetaQuery leverages the gene catalog and gene abundances published by Li to rapidly estimate the abundance of one or more query genes in the humangut microbiome (Supplementary Fig. S1). First, the user submits one or more protein sequences in FASTA format, which are aligned against the gene catalog using either BLAST (Altschul ) or RAPsearch2 (Zhao ). Next, homologs of the query sequence(s) are identified in the gene catalog based on the resulting alignments and user-specified thresholds, which give the user flexibility to target either close or remote homologs of the query sequence in the gene catalog. For each query, the abundances of identified homologs are rapidly obtained from a precomputed matrix, and these abundances are summed per-query and per-sample. Next, gene abundances are optionally normalized using the relative abundance of 30 universal single-copy genes. Finally, gene abundance(s) are compared against a background set of queries in order to give the user a context in which to interpret their results.
2.2 Statistical analysis
After having obtained gene abundances, MetaQuery performs a number of statistical analyses. In the case of multiple query sequences, MetaQuery will build a Spearman correlation matrix of query genes across microbiome samples. Gene co-variation can identify genes that are physically linked on a genome, or genes that functionally interact in a metabolic pathway or protein complex. Next, Kruskal–Wallis tests are performed to identify genes that are differentially abundant between sample groups including: host continent (i.e. North America, Europe and Asia), and host health status (e.g. inflammatory bowel disease and diabetes). Finally, MetaQuery performs Spearman correlations of gene abundance versus average genome size (Nayfach and Pollard, 2015) and MetaPhlan (Segata ) taxonomic abundances.
3 Case study
We used MetaQuery to explore metagenomic variation of the fructan utilization locus found in Bacteroides thetaiotamicron. This locus consists of a cluster of co-regulated genes that degrade non-digestible fructose-based polysaccharides from the human diet (Sonnenburg ). We found that members of the locus tended to be quite abundant and varied extensively across gut microbiome samples, with an average estimated copy number of 1 per 50 cells, which ranked in the top 2% relative to other genes in the gene catalog. The locus was most abundant in American subjects (mean = 1 copy per 38 cells) and lowest in European individuals (mean = 1 copy per 220 cells) (Supplementary Fig. S2A). We found that the locus was marginally associated with both Crohn’s disease (P = 0.048) and diabetes (P = 0.045), indicating a potential role of microbes capable of fructan utilization in human disease (Supplementary Fig. S2B–D). Interestingly, variation of the fructan locus was strongly correlated with both AGS (ρ = 0.62) and the relative abundance of Bacteroides (ρ = 0.68), although even in communities with large AGS or high Bacteroides abundance, there was still a large variation in the abundance of the locus (Supplementary Fig. S3). Finally, we observed that the abundance of genes BT1757-58 and BT1760-63 was strongly correlated across hosts (all ρ > 0.97), which is consistent with the fact that these genes are physically and functionally linked (Supplementary Fig. S4).
4 Conclusions
MetaQuery is a web application that allows rapid and quantitative analysis of genes in the humangut microbiome. Our simple framework should enable researchers to easily investigate metagenomic variation of specific genes of interest across a large cohort of samples from the gut microbiome. Our current reference database contains genes and abundances for 1267 samples. In the future, these databases could be updated as additional fecal metagenomes become publicly available. Finally, this framework is not restricted to the humangut microbiome and could be applied to other environments, including metagenomes from soil and marine environments.
Authors: Erica D Sonnenburg; Hongjun Zheng; Payal Joglekar; Steven K Higginbottom; Susan J Firbank; David N Bolam; Justin L Sonnenburg Journal: Cell Date: 2010-06-24 Impact factor: 41.582
Authors: Mohamed S Donia; Peter Cimermancic; Christopher J Schulze; Laura C Wieland Brown; John Martin; Makedonka Mitreva; Jon Clardy; Roger G Linington; Michael A Fischbach Journal: Cell Date: 2014-09-11 Impact factor: 41.582
Authors: Sahar Abubucker; Nicola Segata; Johannes Goll; Alyxandria M Schubert; Jacques Izard; Brandi L Cantarel; Beltran Rodriguez-Mueller; Jeremy Zucker; Mathangi Thiagarajan; Bernard Henrissat; Owen White; Scott T Kelley; Barbara Methé; Patrick D Schloss; Dirk Gevers; Makedonka Mitreva; Curtis Huttenhower Journal: PLoS Comput Biol Date: 2012-06-13 Impact factor: 4.475
Authors: Chun-Jun Guo; Fang-Yuan Chang; Thomas P Wyche; Keriann M Backus; Timothy M Acker; Masanori Funabashi; Mao Taketani; Mohamed S Donia; Stephen Nayfach; Katherine S Pollard; Charles S Craik; Benjamin F Cravatt; Jon Clardy; Christopher A Voigt; Michael A Fischbach Journal: Cell Date: 2017-01-19 Impact factor: 41.582
Authors: A Sloan Devlin; Angela Marcobal; Dylan Dodd; Stephen Nayfach; Natalie Plummer; Tim Meyer; Katherine S Pollard; Justin L Sonnenburg; Michael A Fischbach Journal: Cell Host Microbe Date: 2016-12-01 Impact factor: 21.023
Authors: Vayu Maini Rekdal; Paola Nol Bernadino; Michael U Luescher; Sina Kiamehr; Chip Le; Jordan E Bisanz; Peter J Turnbaugh; Elizabeth N Bess; Emily P Balskus Journal: Elife Date: 2020-02-18 Impact factor: 8.140
Authors: Paola Soto-Perez; Jordan E Bisanz; Joel D Berry; Kathy N Lam; Joseph Bondy-Denomy; Peter J Turnbaugh Journal: Cell Host Microbe Date: 2019-09-03 Impact factor: 21.023
Authors: Chun-Jun Guo; Breanna M Allen; Kamir J Hiam; Dylan Dodd; Will Van Treuren; Steven Higginbottom; Kazuki Nagashima; Curt R Fischer; Justin L Sonnenburg; Matthew H Spitzer; Michael A Fischbach Journal: Science Date: 2019-12-13 Impact factor: 47.728
Authors: Fang-Yuan Chang; Piro Siuti; Stephane Laurent; Thomas Williams; Emerson Glassey; Andreas W Sailer; David Benjamin Gordon; Horst Hemmerle; Christopher A Voigt Journal: Nat Microbiol Date: 2021-04-12 Impact factor: 17.745
Authors: Renuka R Nayak; Margaret Alexander; Ishani Deshpande; Kye Stapleton-Gray; Bipin Rimal; Andrew D Patterson; Carles Ubeda; Jose U Scher; Peter J Turnbaugh Journal: Cell Host Microbe Date: 2021-01-12 Impact factor: 21.023