| Literature DB >> 32743860 |
Carlos M Duarte1,2, David K Ngugi1,3, Intikhab Alam2, John Pearman1, Allan Kamau3, Victor M Eguiluz4, Takashi Gojobori3, Silvia G Acinas5, Josep M Gasol5,6, Vladimir Bajic3, Xabier Irigoien1,7.
Abstract
Massive metagenomic sequencing combined with gene prediction methods were previously used to compile the gene catalogue of the ocean and host-associated microbes. Global expeditions conducted over the past 15 years have sampled the ocean to build a catalogue of genes from pelagic microbes. Here we undertook a large sequencing effort of a perturbed Red Sea plankton community to uncover that the rate of gene discovery increases continuously with sequencing effort, with no indication that the retrieved 2.83 million non-redundant (complete) genes predicted from the experiment represented a nearly complete inventory of the genes present in the sampled community (i.e., no evidence of saturation). The underlying reason is the Pareto-like distribution of the abundance of genes in the plankton community, resulting in a very long tail of millions of genes present at remarkably low abundances, which can only be retrieved through massive sequencing. Microbial metagenomic projects retrieve a variable number of unique genes per Tera base-pair (Tbp), with a median value of 14.7 million unique genes per Tbp sequenced across projects. The increase in the rate of gene discovery in microbial metagenomes with sequencing effort implies that there is ample room for new gene discovery in further ocean and holobiont sequencing studies.Entities:
Year: 2020 PMID: 32743860 PMCID: PMC7756799 DOI: 10.1111/1462-2920.15182
Source DB: PubMed Journal: Environ Microbiol ISSN: 1462-2912 Impact factor: 5.491
Fig 1The abundance distribution of non‐redundant genes in the Red Sea community examined here. The red line shows a maximum‐likelihood estimate fit power‐law decay (F(S) ~ S – α; R 2 = 0.99; P< 0.0001), with an exponent α of −2.43. An amplitude larger than 2 suggests the prevalence of rare genes in the microbiome. Further details are summarized in Supplementary Data 1. [Color figure can be viewed at wileyonlinelibrary.com]
Fig 2The relationship between the cumulative sequencing effort applied to metagenome samples retrieved along the 20‐day of the experiment from the different mesocosms enclosing the same initial Red Sea plankton community and the cumulative number of non‐redundant genes (A) as well as unique clusters of gene (B) and protein sequences (C) retrieved separately from each sample. The dotted red line shows the first‐order linear best‐fit regression. [Color figure can be viewed at wileyonlinelibrary.com]
Fig 3Redundancy of retrieved genes in different experimental perturbations. The ratios of the non‐redundant complete genes (90% identity over 80% of the short gene length) and the total number of predicted genes. Values are based on Supplementary Data 1. [Color figure can be viewed at wileyonlinelibrary.com]
Fig 4Abundance distribution and diversity of gene families. A. Rank‐abundance curve of gene families to assigned KEGG orthologues (KOs). Colour symbols indicate how dominant (red) and rare (yellow) gene families follow a multimodal distribution. B. Box plots showing the average gene sequence clusters for the top five abundant KOs (in red) encompassing three broad functional groups. The whiskers denote the 10th and the 90th percentile, while the middle horizontal line shows the median (n= 65 samples per KO). C. Alpha diversity of KOs (n = 12 516) across the treatments and the control. [Color figure can be viewed at wileyonlinelibrary.com]
The number of non‐redundant gene sequence clusters predicted in various metagenome projects exploring marine pelagic microbial communities and mammalian enteric microbiomes and corresponding yield relative to the sequencing depth applied.
| Sequenced depth (Tbp) | Gene/protein sequences (×106) | Yield | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Project (gene catalogue) | Analytical procedures | Redundant | Non‐redundant | (106 per Tbp) | Original data source | ||||
| Samples | Genes | Proteins | Genes | Proteins | |||||
|
| |||||||||
| Global Ocean Sampling (GOS) | GP + GC | 44 | 0.00625 | 13.6 | 4.5 | 3.9 | 720.0 | 624.0 | Rusch |
| Baltic Sea reference metagenomes | GP + GC | 81 | 0.586 | 8.7 | 8.6 | 8.3 | 14.7 | 14.2 | Hugerth |
|
| GP + GC | 243 | 5.821 | 61.3 | 33.3 | 27.7 | 5.7 | 4.8 | Sunagawa |
| RSCK2011 | AM + GP + GC | 45 | 0.0483 | 2.0 | 1.3 | 1.2 | 26.9 | 24.8 | Thompson |
| Station ALOHA (HOTGC) | RAM + GP + GC | 103 | 0.638 | 47.3 | 29.6 | 26.1 | 46.4 | 40.9 | Mende |
| GEOTRACES program | GP + GC | 610 | 5.024 | 72.9 | 29.1 | 24.1 | 5.8 | 4.8 | Biller |
| MALASPINA‐Deep (MDSGC) | RAM + GP + GC | 60 | 0.121 | 11.8 | 6.7 | 6.3 | 55.4 | 52.1 | Acinas |
| MALASPINA‐profiles (MRGC) | AM + GP + GC | 116 | 1.714 | 61.6 | 32.7 | 29.0 | 19.1 | 16.9 | P. Sanchez |
| MESOCOSM | AM + GP + GC | 65 | 0.163 | 5.1 | 2.8 | 2.6 | 17.2 | 16.0 | This study |
|
| |||||||||
| Human gut microbiome I | GP + GC | 124 | 0.577 | 9.7 | 4.1 | 3.8 | 7.1 | 6.6 | Qin |
| Human gut microbiome II | GP + GC | 1267 | 6.298 | 121.3 | 18.5 | 16.0 | 2.9 | 2.5 | Li |
| Mouse gut microbiome | 184 | 0.781 | 22.2 | 2.6 | n.d. | 3.3 | n.d. | Xiao | |
| Rat gut microbiome | GP + GC | 98 | 0.222 | 26.8 | 7.6 | 6.9 | 34.2 | 31.0 | Pan |
| Pig gut microbiome | 287 | 1.761 | 62.9 | 7.7 | n.d. | 4.4 | n.d. | Xaio | |
Abbreviations: OM‐RGC, Ocean Microbial Reference Gene Catalogue; MDGC, Malaspina Deep‐Sea Gene Collection; MPRGC, Malaspina Reference Gene Catalogue; Red Sea Centre Cruise 2011; HOTGC, Hawaii Ocean Time‐series Gene Catalogue, AM, assembled; RAM, re‐assembled; GP, gene prediction; GC, gene cluster with mmseq2; n.d. not determined.
Based on Prodigal and retaining only complete genes. However, PGM and MGM are based on MetaGene since no assemblies were available.
Unless stated otherwise, all datasets were (re)analysed with the same procedures to minimize procedural artefacts.
Based on high‐quality read sequences except for the GEOTRACES program (raw sequencing depth) and GOS (total length of Sanger contigs ≥500 bp).
Defined as sequence clusters with 95% global identity over 80% of the length.
Defined as sequence clusters with 90% global identity over 80% of the length.
Reported values may differ from the original reference (when reported) since re‐analyses were done in the context of this study.
Includes protein‐coding genes from coassembly of unmapped reads. Details are provided in Supplementary Fig. S1.
Based on the assembly of data from Thompson et al. (2017) under BioProject number PRJNA289734.
Based on re‐assembly of data from Mende et al. (2017) using metaSPAdes (see Supplementary Fig. S1).
Includes time‐series data from BATS and HOT, with a total of 130 metagenomes.
Includes incomplete gene sequences (up to two‐thirds), with clusters defined at 95% identity over 90% of the length. The minimum length was 100 bp.
Fig 5The relationship between sequencing depth and the number of non‐redundant genes or gene yield in different high‐throughput sequencing projects encompassing only marine metagenomes (A and B) or including host‐associated microbiomes (B and C). The black line indicates the log–log fit to the data (Supplementary Data 11), while the grey area shows the 95% confidence intervals for the regression curve. The distributions of the ‘x’ and ‘y’ axis values are highlighted in blue and green respectively, with corresponding medians denoted by coloured dotted lines. Black and orange circular symbols denote marine and mammalian enteric microbiomes respectively. Abbreviations: RS, Red Sea Cruise 2011; MD, Malaspina Deep; BARM, Baltic Sea reference metagenomes; MP, Malaspina Profile; HGM, human gut microbiome; PGM, pig gut microbiome; MGM, mouse gut microbiome; RGM, rat gut microbiome. Note that GOS (Table 1) was not included as it used a different sequencing technology. Additional information is provided in Supplementary Data 1–11. [Color figure can be viewed at wileyonlinelibrary.com]