| Literature DB >> 31504765 |
Sicheng Wu1, Chuqing Sun1, Yanze Li1, Teng Wang1, Longhao Jia1, Senying Lai1, Yaling Yang1,2, Pengyu Luo1, Die Dai1, Yong-Qing Yang3, Qibin Luo4, Na L Gao1,5, Kang Ning1,6, Li-Jie He7, Xing-Ming Zhao8,9, Wei-Hua Chen1,6,10.
Abstract
GMrepo (data repository for Gut Microbiota) is a database of curated and consistently annotated human gut metagenomes. Its main purpose is to facilitate the reusability and accessibility of the rapidly growing human metagenomic data. This is achieved by consistently annotating the microbial contents of collected samples using state-of-art toolsets and by manual curation of the meta-data of the corresponding human hosts. GMrepo organizes the collected samples according to their associated phenotypes and includes all possible related meta-data such as age, sex, country, body-mass-index (BMI) and recent antibiotics usage. To make relevant information easier to access, GMrepo is equipped with a graphical query builder, enabling users to make customized, complex and biologically relevant queries. For example, to find (1) samples from healthy individuals of 18 to 25 years old with BMIs between 18.5 and 24.9, or (2) projects that are related to colorectal neoplasms, with each containing >100 samples and both patients and healthy controls. Precomputed species/genus relative abundances, prevalence within and across phenotypes, and pairwise co-occurrence information are all available at the website and accessible through programmable interfaces. So far, GMrepo contains 58 903 human gut samples/runs (including 17 618 metagenomes and 41 285 amplicons) from 253 projects concerning 92 phenotypes. GMrepo is freely available at: https://gmrepo.humangut.info.Entities:
Mesh:
Year: 2020 PMID: 31504765 PMCID: PMC6943048 DOI: 10.1093/nar/gkz764
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Overall workflow of GMrepo. Processing steps are indicated in the blue rounded boxes.
Figure 2.Schematic representations of the GMrepo metagenomics pipeline for amplicon data (A) and metagenomic data (B). Processing steps are indicated in the blue rounded boxes and tools are marked on the arrows. Input and output files as colored rectangles (black, green, red). Conditional judgments are in trapezoids. QC1: a run will be marked as ‘failed’ (QCStatus = = 0) if less than 20k reads or <50% of reads were retained after trimming; QC2: arun will be marked as ‘failed’ (QCStatus = = 0) if a single taxon accounts for >99.99% of the total abundance.
Figure 3.Statistics of some of the metadata we collected. (A) The unknown phenotype means that the health status of the sample provider is not clearly indicated. For data from the American Gut Project (AGP), we only use diagnoses from medical professionals (doctor, physician assistant). Samples with unknown phenotypes are mainly from AGP. (B) The integrity of the metadata is assessed based on age, sex and BMI.
Top 10 phenotypes included in GMrepo
| Phenotype | No. of runs | No. of processed runs | No. of valid runs | No. of failed runs | No. of associated species | No. of associated genus |
|---|---|---|---|---|---|---|
| Health | 27 329 | 20 320 | 12 485 | 7835 | 6189 | 1613 |
| Colitis, Ulcerative | 2509 | 2440 | 1175 | 1265 | 4183 | 1285 |
| Irritable Bowel Syndrome | 2092 | 2091 | 954 | 1137 | 3320 | 1064 |
| Infant, Premature | 1443 | 1443 | 1240 | 203 | 260 | 97 |
| Colorectal Neoplasms | 1374 | 1374 | 1256 | 118 | 4596 | 1380 |
| Diarrhea | 1355 | 1354 | 470 | 884 | 2775 | 906 |
| Constipation | 1244 | 1244 | 611 | 633 | 3146 | 1022 |
| Migraine Disorders | 1235 | 1235 | 574 | 661 | 2894 | 964 |
| Lung Diseases | 1228 | 1228 | 592 | 636 | 2817 | 958 |
| Autoimmune Diseases | 1154 | 1154 | 547 | 607 | 2848 | 956 |
No. of runs: all runs with curated meta-data,
No. of processed runs: number of all runs with the sequence data processed; please note all runs will be processed eventually,
No. of valid runs: number of runs whose data passed our QC procedure and the corresponding species/genus relative abundances are available in our database,
No. of failed runs: number of runs whose data DID NOT passed our QC procedure,
No. of associated species: number of species associated with the processed and valid runs.
No. of associated genus: number of genus associated with the processed and valid runs.
Figure 4.Phylogenetic tree comprising the 2685 included species, based on NCBI taxonomy. These 2685 species were found in more than one samples with a median relative abundance higher than 0.01% within one or more phenotypes. The three inner layers show the statistics of these species in our database, including the median relative abundance of the species (red) and the species prevalence in the samples (brown) and phenotypes (yellow). The outermost layer shows the corresponding phyla of these species.
Figure 5.Details of a species in Crohn's Diseases. Faecalibacterium prausnitzii is chosen to show its distributions (A) and relative abundances (B) in Crohn's Disease. For various disease phenotypes, the relative abundances of the species of interest in healthy controls (green) will also be retrieved and visualized side-by-side with the disease (red). (C) A species co-occurrence network constructed based on the significantly co-occurred pairs for a phenotype (Crohn's Disease). Nodes: species co-occurred with others in samples of this phenotype with sizes proportional to the number of connected nodes in the network. Links: indicate co-occurring relationships between species with widths proportional to the absolute value of the correlation coefficient (Pearson correlation), while the colors indicate positive (green) or negative (red) correlations. Placing a mouse over a node can highlight the node and its direct neighbors and show the names of the node and its direct neighbors.
Figure 6.Graphical selectors and three examples. These selectors support complex logic combinations (AND, OR and grouping) that allow users to perform biologically relevant queries. (A) Shows how to find samples from healthy individuals with BMIs between 18.5 and 24.9; (B) allows users to find fecal samples of Americans who did not recently use antibiotics; (C) shows how to find projects that are related to neurological diseases (e.g. including autism spectrum disorder, bipolar disorder and depression) and each contains healthy controls.