Literature DB >> 34788838

GMrepo v2: a curated human gut microbiome database with special focus on disease markers and cross-dataset comparison.

Die Dai¹, Jiaying Zhu¹, Chuqing Sun¹, Min Li¹, Jinxin Liu², Sicheng Wu¹, Kang Ning¹, Li-Jie He³, Xing-Ming Zhao^2,4,5, Wei-Hua Chen^1,6.

Abstract

GMrepo (data repository for Gut Microbiota) is a database of curated and consistently annotated human gut metagenomes. Its main purposes are to increase the reusability and accessibility of human gut metagenomic data, and enable cross-project and phenotype comparisons. To achieve these goals, we performed manual curation on the meta-data and organized the datasets in a phenotype-centric manner. GMrepo v2 contains 353 projects and 71,642 runs/samples, which are significantly increased from the previous version. Among these runs/samples, 45,111 and 26,531 were obtained by 16S rRNA amplicon and whole-genome metagenomics sequencing, respectively. We also increased the number of phenotypes from 92 to 133. In addition, we introduced disease-marker identification and cross-project/phenotype comparison. We first identified disease markers between two phenotypes (e.g. health versus diseases) on a per-project basis for selected projects. We then compared the identified markers for each phenotype pair across datasets to facilitate the identification of consistent microbial markers across datasets. Finally, we provided a marker-centric view to allow users to check if a marker has different trends in different diseases. So far, GMrepo includes 592 marker taxa (350 species and 242 genera) for 47 phenotype pairs, identified from 83 selected projects. GMrepo v2 is freely available at: https://gmrepo.humangut.info.

Entities: Chemical

Mesh：

Substances：

Year: 2022 PMID： 34788838 PMCID： PMC8728112 DOI： 10.1093/nar/gkab1019

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Gut microbiota is important in maintaining normal physiology of host throughout life (1–4). Growing evidence suggests that microbiota disruption can affect the immune function (5,6), metabolism (7,8), energy production (9,10) and cause various diseases (11–18). Many factors including age (19), sex (20), body-mass-index (BMI) (21), country, environment (22), genetics (23), diet (24) and recent antibiotics usage (25) can influence composition of gut microbial communities (26). In recent years, the study of the impact of gut microbiota on human health is a rapidly moving field of research and has been widely considered as an exciting advancement in biomedicine (27–29). Increasing numbers of human gut metagenomic data (including both 16S amplicon and shotgun metagenomics sequencing data) has been rapidly generated. Raw sequencing data are often deposited into several general purpose databases, including European Nucleotide Archive (ENA) (30) (https://www.ebi.ac.uk/ena) and NCBI Sequence Read Archive (SRA) (31) (https://www.ncbi.nlm.nih.gov/sra); in addition, several other public resources, including MGnify (32), gcMeta (33) and Qiita (34) have collected processed data and organized them according to the habitats from which the samples were taken, while gutMDisorder (35), GIMICA (36) and DISBIOME (37) have linked gut microbiota dysbiosis with various human diseases. These existing databases greatly promoted data reuse. However, obstacles of the reusability and accessibility of the rapidly growing human metagenomic data still remain, especially the inaccurate and/or incomplete phenotype information and/or missing metadata; for example, our previous analysis revealed that ∼30% of gut metagenomic samples did not have any of the three basic information including age, gender and BMI, even after several rounds of manual curation (38). Recently, curatedMetagenomicData, a curated metagenomic data resource became available; it provides standardized, curated human microbiome data with pre-calculated taxonomic and functional annotations (39), which will greatly facilitate data reusability and promote novel analysis of human metagenomics. However, it is available as a R package and might be difficult to use for non-R users; in addition, cross-project comparisons are not straightforward in curatedMetagenomicData such as the prevalence of a species of interests across samples of multiple diseases and whether a disease marker species is specific to that disease or shared by multiple diseases. In 2020, we introduced GMrepo v1 (data repository for Gut Microbiota) (38) as an online database of curated and consistently annotated human gut metagenomes to facilitate the reusability and accessibility of the increasing human metagenomic data. We performed extensive meta-data curation for each collected run/sample and include all possible related meta-data, such as age, sex, country, body-mass-index (BMI) and recent antibiotics usage. Further, we consistently annotated microbial contents by assigning the sequencing reads to taxonomic units and pre-computed species/genus relative abundances using state-of-the-art toolsets. We organized the collected samples based on their associated phenotypes and added within- and cross-phenotype statistics including taxonomic abundances, prevalence and co-occurrences, to facilitate researchers to easily explore the distribution of a species/genus in diseases of interests and compare to that of healthy controls. In addition, we provided programmable access to achieve most of contents in GMrepo by representational state transfer (REST) application programming interfaces (APIs). GMrepo is equipped with powerful graphical query builders to make users search the collected samples and projects more conveniently (38). In this study, we introduce an updated version of GMrepo. In this new version, we collected more projects, runs/samples and phenotypes. Most importantly, we added disease marker identification and cross-project/phenotype comparisons of the identified markers. The main features of GMrepo v2 include: (a) identification of disease markers between two phenotypes (e.g. health versus adenoma) on per-project basis for selected projects, (b) cross-dataset disease marker comparison to facilitate the identification of consistent microbial markers across multiple datasets of the same diseases, (c) cross-disease marker comparison to allow users to check if microbial markers are unique to a specific disease or shared by multiple diseases, and if they have different trends in different diseases.

DATA GENERATION

Collection of sequencing reads and manual curation of meta-data

To obtain more human gut metagenomic data, we searched recently updated projects in the NCBI BioProject database (https://www.ncbi.nlm.nih.gov/bioproject) and publications in the NCBI PubMed database (https://pubmed.ncbi.nlm.nih.gov/) using ‘human gut microbiota’ as the keyword. Projects with clearly defined phenotype information and public raw sequencing data were collected for further analysis. The raw sequencing reads were downloaded from NCBI SRA (Sequence Read Archive, https://www.ncbi.nlm.nih.gov/sra) (31) and EBI ENA (European Nucleotide Archive, https://www.ebi.ac.uk/ena) (30) database as reported in GMrepo v1 (38). Related meta-data were also downloaded using in-house PERL (version 5.30.0) and R (version 4.0.4) scripts. We then performed two rounds of manual curation on the meta-data. For the first round, meta-data were extracted and manually examined, including technical meta-data such as the sequencing platform, type of sequences obtained (i.e. 16S rRNA amplicon or whole-genome metagenomic) and number of sequences, and the human host related meta-data such as phenotypes (health or diseases), BMI, age, sex, diet, country and antibiotic usage of the associated samples/runs. For the second round, different curators from the first round reviewed the collected meta-data and made necessary corrections. GMrepo v2 now contains 71 642 runs/samples from 353 projects. Among these runs/samples, 45 111 and 26 531 were obtained by 16S rRNA amplicon (16S for short) and whole-genome metagenomics (mNGS for short) sequencing, respectively. This collection represents a significant increase compared to our previous version, which contained 58 903 runs/samples 253 projects. In addition, the number of phenotypes (i.e. health and diseases) has also been increased from 92 to 133. The newly added diseases include COVID-19 (four projects; https://gmrepo.humangut.info/phenotypes/D000086382) and many others.

Taxonomic assignment and relative abundances calculation

The newly collected raw sequencing data were processed as reported in the previous version (38). In short, we first evaluated the overall quality of the downloaded data using FastQC (version 0.11.8, http://www.bioinformatics.babraham.ac.uk/projects/fastqc/), followed by removing low-quality bases and sequencing vectors using Trimmomatic with default parameters (40). Then we annotated microbial contents and calculated relative abundances. For 16S sequences, QIIME2 (41) was used to analyze the obtained clean data and assign taxonomic classification information to the reads; ASV (Amplicon Sequence Variant) instead of OTU (Operational Taxonomic Units) results were used, as the former can provide more precise measurement of sequence variation and be able to easily compare sequences between different studies (tractability and reproducibility) (41). DADA2 (version 1.18.0) (42) pipeline implemented in QIIME2 was used to filter the sequencing reads and construct ASVs table without any sequence clustering. Deblur (43) was used for denoising and chimera removal. The taxonomy classification database used was Greengenes version 13.8. Relative abundances were then calculated for each sample at both species and genus levels, with totaling abundances of 100% at both levels respectively. For whole-genome metagenomic sequences, MetaPhlAn2 (44) was applied with default parameters for the taxonomic assignments to the sequencing reads. We calculated the relative abundances at species and genus levels, with totaling abundances of 100% at both levels respectively.

DISEASE MARKER IDENTIFICATION AND CROSS-DATASET COMPARISON

One of the main features that distinguish GMrepo from other metagenomic databases is the cross-dataset comparison. To bring this to the next level, we introduced disease-marker identification for selected high-quality projects and provided tools to facilitate cross-project/phenotype comparisons of the identified markers.

Identification of disease markers between two phenotypes

To better understand the relationships between gut microbiota dysbiosis and human diseases, we performed in-depth analysis to identify differential bacteria and disease markers between two phenotypes (e.g., health versus adenoma) on per-project basis for selected projects, especially those with high-quality data. To get reliable results of the analysis, we selected projects that met all the criteria: (i) the project must include at least two phenotypes, very often between a disease phenotype and healthy controls. For those having different disease stages, we also compared the samples between different stages; (ii) numbers of samples must be >20. Furthermore, manually curation was performed for selected projects in order to: (a) select usable runs with clearly defined phenotype(s), (b) merge multiple runs if they correspond to the same sample, (c) calculate taxon abundances on per-sample basis instead of per-run basis, (d) group samples according to their corresponding phenotypes and (e) identify marker taxa between a pair of phenotypes of interests, e.g., Health versus colorectal cancer (CRC). Here, a ‘marker taxon’ refers to a species or genus whose relative abundances showed significant differences between phenotypes. In GMrepo, marker taxa were identified using LEfSe (Linear discriminant analysis Effect Size) analysis (45) implemented in the ‘microbiomeMarker’ package (version 0.0.1.9000) of R (version 4.0.4). Linear discriminant analysis (LDA) scores were used to describe the extents of the differences, with larger values indicating more significant differences. A LDA cutoff of 2 was used as the cutoff for marker taxa. Markers were identified on a per-dataset basis in order to control for project-specific confounding factors such as DNA extraction methods and sequencing platforms. For 16s rDNA sequencing datasets, genus level markers were identified; for whole-genome metagenomic datasets (also known as metagenomic next-generation sequencing datasets, mNGS for short), both species and genus level markers were identified. Since our marker identification was on per-project basis, for each project we added bar plots to visualize the marker taxa between two phenotypes. Shown in Figure 1 are the marker species identified between Health and Adenoma for BioProject PRJEB6070. For mNGS datasets like PRJEB6070, genus level markers are also available; researchers can choose to show markers at either species or genus level, or both levels together (see Figure 1 for more details; see also https://gmrepo.humangut.info/data/project/PRJEB6070/D006262/D000236). BioProject PRJEB6070 contained samples of three phenotypes, namely health, adenoma and CRC. Thus, in addition to markers between heath and adenoma, we also identified marker taxa for health versus CRC, and adenoma versus CRC respectively. See https://gmrepo.humangut.info/data/project/PRJEB6070 for more details. Note the newly added marker identification results, the phenotype pairs and their corresponding runs and groups can be found in the ‘in-depth analysis’ section of the webpage.

Figure 1.

Disease markers identified between two phenotypes in a project. Here data from BioProject PRJEB6070 are used as an example; health and disease (adenoma) enriched species are plotted in green and pink respectively. The markers were identified using LEfSe. LDA (linear discriminant analysis) scores (X-axis) were used to show the extents of their enrichment. For whole-genome metagenomic dataset like PRJEB6070, genus level markers were also identified. Users can use the widgets (blue buttons) to choose the markers to show. For 16S rRNA datasets, only genus level markers were identified; thus, the ‘Species’ button will be unclickable. So far, GMrepo includes 592 marker taxa (350 species and 242 genera) for 47 phenotype pairs, identified from 83 selected projects; more projects will be analyzed in the future. The detailed information of these marker taxa is listed in https://gmrepo.humangut.info/taxon/markertaxa. Additional links to the NCBI taxonomy (46), ENA taxonomy and NCBI MeSH Browser were also provided for each of the marker taxa, in order to facilitate researchers to obtain more information. More external databases will be included in the future.

Cross-dataset disease marker comparison

In GMrepo, a disease could be covered by multiple datasets/projects. A recent meta-analysis on CRC-associated gut microbiome projects suggested that disease-related microbial markers were not always consistent across studies/projects (47). Thus, to facilitate cross-project comparisons of the identified markers within each project, we added a dedicated page for each phenotype pair (e.g. health versus liver cirrhosis, or adenoma versus colorectal cancer) to systematically show the consistent and non-consistent disease-associated microbial markers across datasets. Shown in Figure 2 are two typical examples. For example, Figure 2A shows the biomarkers between heath and CRC across seven datasets (See also https://gmrepo.humangut.info/phenotypes/comparisons/D006262/D015179 for details). We observed consistent disease-associated microbial markers across the seven projects; for example, known marker species of CRC including Fusobacterium nucleatum (48–52), Parvimonas micra (53,54) and Gemella morbillorum (55) were identified to be enriched in CRC patients in most studies. However, in the phenotype comparisons of health and ‘arthritis, rheumatoid’, we observed non-consistent disease-associated microbial makers (Figure 2B, see also https://gmrepo.humangut.info/phenotypes/comparisons/D006262/D001172 for details). So far, GMrepo includes 47 phenotype comparisons, see https://gmrepo.humangut.info/phenotypes/comparisons for a complete list.

Figure 2.

Cross-study comparison of microbial markers. (A) Comparison of marker species for colorectal cancer in seven metagenomic projects. (B) Comparison of marker species for ‘arthritis, rheumatoid’ in two projects. Marker taxa with LDA <−2 are health enriched, while those with LDA >2 are disease enriched. Health and disease enriched markers are shown in green and red respectively, with deeper color indicate increased enrichment. To facilitate users to explore the markers, a few widgets are included to allow users to 1) filter markers according to the number of projects they are identified, 2) filter markers according to the absolute LDA scores, 3) exclude markers that show inconsistent trends (e.g. those are significantly decreased in disease in one project but significantly increased in others) among projects and 4) change the size of the tiles. Users can also save the resulting visualization as SVG or PNG format. Please consult https://gmrepo.humangut.info/phenotypes/comparisons/D006262/D015179 and https://gmrepo.humangut.info/phenotypes/comparisons/D006262/D001172 for the interactive versions on our website; for the second link, please change the value of the ‘NR.PROJECTS (> = ):’ widget on the webpage to ‘1’ in order to show the markers.

Cross-disease marker comparison

We also provided a marker-centric view to allow users to check if a microbial marker is unique to a specific disease or shared by multiple diseases, and if it has different trends in different diseases. Take F. nucleatum as an example, it has been identified as a marker species in eight phenotype comparisons and showed consistent trends as a disease-enriched maker (Figure 3A, see also https://gmrepo.humangut.info/taxon/851). In addition to being a CRC marker, F. nucleatum is also associated with multiple diseases in GMrepo v2, including Cardiovascular Disease, Inflammatory Bowel Diseases, Liver Cirrhosis and COVID-19. Interestingly, although F. nucleatum was enriched in CRC samples in the adenoma versus CRC comparison (Figure 3A), it was not enriched in the adenoma samples as compared with the healthy controls (see also https://gmrepo.humangut.info/phenotypes/comparisons/D006262/D000236), suggesting it came at the latter stages of CRC (and maybe other diseases). These results are consistent with recent publications that F. nucleatum is not a marker for gut microbiota-based adenoma diagnostic models (56,57). Conversely, Prevotella copri was found to have inconsistent trends between phenotype pairs (Figure 3B and also https://gmrepo.humangut.info/taxon/165179) and even between projects of the same phenotype comparisons. P. copri was reported to be associated with gut microbial enterotypes whose abundances could be affected by diet, age and gender (58,59). The inconsistent trends may indicate either the undetected biases between disease and control groups in the related datasets, or an equilibrium state for P. copri in the gut should be maintained.

Figure 3.

Cross-disease comparison of marker taxa. (A) Enrichment trends of Fusobacterium nucleatum across diseases and projects. (B) A marker-centric view of Prevotella copri across diseases and projects. Please consult https://gmrepo.humangut.info/taxon/851 and https://gmrepo.humangut.info/taxon/165179 for the online versions.

Future directions

In addition to continuously adding new human gut metagenomic data to GMrepo in the future, we plan to add new contents to GMrepo, including (but not limited to) functional profiles and metabolic pathway profiles for the collected samples. It is also necessary to re-analyze all data with the latest version of the tools, or use new tools that become available in the future. In addition, we plan to include genomic sequences for the identified species, especially those directly assembled from human gut metagenome datasets (60). These will further facilitate the reusability and accessibility of human gut metagenomic data and will contribute to better understanding of the relationships between gut microbiota dysbiosis and human diseases.

CONCLUSIONS

In this study, we introduced GMrepo v2, an updated version of the online database of curated, consistently annotated meta-data and human gut metagenomic data. Updates since the last version include increased numbers of projects, samples/runs and phenotypes by multiple rounds of extensive manual curation of the meta-data. One of the main features that distinguish GMrepo from other metagenomic databases is cross-dataset comparison. To bring this to the next level, we introduced disease-marker identification and performed cross project/phenotype comparisons, including: (i) identification of disease markers between two phenotypes on per-project basis for selected projects, especially those with high-quality data; (ii) cross-dataset disease marker comparison to facilitate the identification of consistent microbial markers across datasets; (iii) cross-disease marker comparison to provide a marker-centric view to allow users to check if microbial markers have different trends in different diseases. So far, GMrepo includes 592 marker taxa (350 species and 242 genera) for 47 phenotype pairs, identified from 83 selected projects; more projects will be analyzed in the future. We believe that GMrepo v2 is expected to be a highly useful and an important database for biologists and bioinformaticians studying gut microbiome. In the future, we aim to update GMrepo regularly to provide up-to-date contents and include more functionalities.

DATA AVAILABILITY

All data are freely accessible to all academic users. This work is licensed under a Creative Commons Attribution Non-Commercial 3.0 Unported License (CC BY-NC 3.0). Users can download dataset from the ‘Data downloads’ section of the ‘Help’ page. Users can also download individual datasets or combined datasets for individual project/phenotype/species via the ‘Browse’ page. We also provided programmable access through REST APIs. And users can obtain our datasets based on the detailed instructions on using R, Perl and Python at the ‘Programmable access’ section of the ‘Help’ page or our GitHub page: https://github.com/evolgeniusteam/GMrepoProgrammableAccess.

60 in total

1. Diet, gut microbiota and immune responses.

Authors: Kendle M Maslowski; Charles R Mackay
Journal: Nat Immunol Date: 2011-01 Impact factor: 25.606

2. Treatment regimens may compromise gut-microbiome-derived signatures for liver cirrhosis.

Authors: Sicheng Wu; Puzi Jiang; Xing-Ming Zhao; Wei-Hua Chen
Journal: Cell Metab Date: 2021-03-02 Impact factor: 27.287

Review 3. Gut microbiota in human metabolic health and disease.

Authors: Yong Fan; Oluf Pedersen
Journal: Nat Rev Microbiol Date: 2020-09-04 Impact factor: 60.633

4. Enterotypes of the human gut microbiome.

Authors: Manimozhiyan Arumugam; Jeroen Raes; Eric Pelletier; Denis Le Paslier; Takuji Yamada; Daniel R Mende; Gabriel R Fernandes; Julien Tap; Thomas Bruls; Jean-Michel Batto; Marcelo Bertalan; Natalia Borruel; Francesc Casellas; Leyden Fernandez; Laurent Gautier; Torben Hansen; Masahira Hattori; Tetsuya Hayashi; Michiel Kleerebezem; Ken Kurokawa; Marion Leclerc; Florence Levenez; Chaysavanh Manichanh; H Bjørn Nielsen; Trine Nielsen; Nicolas Pons; Julie Poulain; Junjie Qin; Thomas Sicheritz-Ponten; Sebastian Tims; David Torrents; Edgardo Ugarte; Erwin G Zoetendal; Jun Wang; Francisco Guarner; Oluf Pedersen; Willem M de Vos; Søren Brunak; Joel Doré; María Antolín; François Artiguenave; Hervé M Blottiere; Mathieu Almeida; Christian Brechot; Carlos Cara; Christian Chervaux; Antonella Cultrone; Christine Delorme; Gérard Denariaz; Rozenn Dervyn; Konrad U Foerstner; Carsten Friss; Maarten van de Guchte; Eric Guedon; Florence Haimet; Wolfgang Huber; Johan van Hylckama-Vlieg; Alexandre Jamet; Catherine Juste; Ghalia Kaci; Jan Knol; Omar Lakhdari; Severine Layec; Karine Le Roux; Emmanuelle Maguin; Alexandre Mérieux; Raquel Melo Minardi; Christine M'rini; Jean Muller; Raish Oozeer; Julian Parkhill; Pierre Renault; Maria Rescigno; Nicolas Sanchez; Shinichi Sunagawa; Antonio Torrejon; Keith Turner; Gaetana Vandemeulebrouck; Encarna Varela; Yohanan Winogradsky; Georg Zeller; Jean Weissenbach; S Dusko Ehrlich; Peer Bork
Journal: Nature Date: 2011-04-20 Impact factor: 49.962

GMrepo v2: a curated human gut microbiome database with special focus on disease markers and cross-dataset comparison.

INTRODUCTION

DATA GENERATION

Collection of sequencing reads and manual curation of meta-data

Taxonomic assignment and relative abundances calculation

DISEASE MARKER IDENTIFICATION AND CROSS-DATASET COMPARISON

Identification of disease markers between two phenotypes

Cross-dataset disease marker comparison

Cross-disease marker comparison

Future directions

CONCLUSIONS

DATA AVAILABILITY

1. Diet, gut microbiota and immune responses.

2. Treatment regimens may compromise gut-microbiome-derived signatures for liver cirrhosis.

Review 3. Gut microbiota in human metabolic health and disease.

4. Enterotypes of the human gut microbiome.

5. Body Mass Index Differences in the Gut Microbiota Are Gender Specific.

6. A novel faecal Lachnoclostridium marker for the non-invasive diagnosis of colorectal adenoma and cancer.

7. Parvimonas micra as a putative non-invasive faecal biomarker for colorectal cancer.

Review 8. Influence of Mediterranean Diet on Human Gut Microbiota.

Review 9. Introduction to the human gut microbiota.

10. gutMDisorder: a comprehensive database for dysbiosis of the gut microbiota in disorders and interventions.

Review 1. Lactobacillus paracasei CNCM I 1572: A Promising Candidate for Management of Colonic Diverticular Disease.

2. Assessment of the safety and probiotic properties of Roseburia intestinalis: A potential "Next Generation Probiotic".

3. An Engineered λ Phage Enables Enhanced and Strain-Specific Killing of Enterohemorrhagic Escherichia coli.