| Literature DB >> 35022702 |
Christophe Djemiel1, Pierre-Alain Maron1, Sébastien Terrat1, Samuel Dequiedt1, Aurélien Cottin1, Lionel Ranjard1.
Abstract
Deciphering microbiota functions is crucial to predict ecosystem sustainability in response to global change. High-throughput sequencing at the individual or community level has revolutionized our understanding of microbial ecology, leading to the big data era and improving our ability to link microbial diversity with microbial functions. Recent advances in bioinformatics have been key for developing functional prediction tools based on DNA metabarcoding data and using taxonomic gene information. This cheaper approach in every aspect serves as an alternative to shotgun sequencing. Although these tools are increasingly used by ecologists, an objective evaluation of their modularity, portability, and robustness is lacking. Here, we reviewed 100 scientific papers on functional inference and ecological trait assignment to rank the advantages, specificities, and drawbacks of these tools, using a scientific benchmarking. To date, inference tools have been mainly devoted to bacterial functions, and ecological trait assignment tools, to fungal functions. A major limitation is the lack of reference genomes-compared with the human microbiota-especially for complex ecosystems such as soils. Finally, we explore applied research prospects. These tools are promising and already provide relevant information on ecosystem functioning, but standardized indicators and corresponding repositories are still lacking that would enable them to be used for operational diagnosis.Entities:
Keywords: ecological traits; functional inference; metabarcoding; microbiota; soil; taxonomy
Mesh:
Substances:
Year: 2022 PMID: 35022702 PMCID: PMC8756179 DOI: 10.1093/gigascience/giab090
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Figure 1:Schematic diagram of the various strategies available for exploring the functional diversity of the microbiota. Green frames indicate metabarcoding approaches for retrieving putative functions from taxonomic genes by functional inference and ecological trait assignment. cDNA: complementary DNA; NMR: nuclear magnetic resonance; rRNA: ribosomal RNA.
Figure 2:Evolution of costs (dollars) per raw megabase of DNA sequence (black line with logarithmic scale), and evolution of the number of SRA metabarcoding data deposited in the NCBI website. The data used to draw this figure are described in Additional File 1, section Figure 2.
Figure 3:Annual cumulative growth of databases in terms of bacterial/archaeal (A) and fungal (B) sequences, and species/subspecies deposited per year. Comparison of the annual cumulative growth of bacterial/archaeal (C) and fungal (D) genomes compared to simulations of Moore's law. The plot is in logarithmic scale. Three databases were compared for 16S rRNA gene sequences: RDP (blue), SILVA (orange), and Greengenes (green). Information is based on the List of Prokaryotic names with Standing in Nomenclature (LPSN [34, 35]) website for bacterial and archaeal species, and on the MycoBank database for fungal species [36, 37]. Information about the bacterial, archaeal, and fungal genomes is based on the Genome OnLine Database (GOLD) [38].
Numbers of organisms, genes, enzymes, and metabolic pathways available in the CAZy, KEGG, and MetaCyc databases
| Database | Organisms | Metabolic pathways | Enzymes/Genes |
|---|---|---|---|
| CAZy | Eukaryotes: 344; Bacteria: 20,421; Archaea: 413 | NA | GH: 171; GT: 114; PL: 41; CE: 19; AA: 16 |
| KEGG | Eukaryotes: 557; Bacteria: 6,317; Archaea: 344 | 547 | KO groups: 24,402 |
| MetaCyc | Total: 3,295 | 2,937 | 13,356 |
When possible, we detailed the number of organisms for the 3 domains of the tree of life. CAZy includes glycoside hydrolases (GH), glycosyl transferases (GT), carbohydrate esterases (CE), polysaccharide lyases (PL), and auxiliary activities (AA). CAZy: Carbohydrate-Active Enzymes; KEGG: Kyoto Encyclopedia of Genes and Genomes; KO: KEGG Orthology; MetaCyc: metabolic pathways and enzymes; NA: not applicable.
Figure 4:Global microbial gene catalogs from various ecosystems. The references are listed in Additional File 1.
Figure 5:Diagram of the granularity of the data (A) that can be obtained by functional inference (B) or ecological trait assignment (C).
Figure 6:Timeline depicting the historical record of the major tools developed for functional inference or ecological trait assignment. The first version of the DEEMY database dates back to 1996; it was omitted for aesthetic reasons.
List of the functional inference tools, ecological trait assignment tools, and databases
| Tool | Implementation | Targeted genes | Functional prediction | Approaches | Methods | Inputs used | Strengths and Specificities | Limitations |
|---|---|---|---|---|---|---|---|---|
| PanFP | Perl (recently Python) | 16S rRNA | KO, Gene Ontology, Pfam, TIGRFAM | Functional inference | Builds a pangenome | NCBI taxonomy |
Uses functional profile of the pangenome so could be less sensitive to horizontal gene transfer |
Evolutionary models are not taken into account No confidence score generated Not yet available for microbial eukaryotes |
| PAPRICA | Python | 16S/18S rRNA | MetaCyc ontology | Functional inference | Phylogenetic placement | Based on rDNA amplicon sequences |
18S rRNA amplicons are taken into account Examples on the developer's blog |
Errors may occur with sequence placement due to poor resolution of rRNA amplicons in some clades |
| PICRUSt | Python | 16S rRNA | KO, KEGG Pathway, COG, CAZy | Functional inference | ASR (Wagner Parsimony, ACE ML, ACE REML, ACE PIC) | Greengenes taxonomy (18may2012 or v13.5/v13.8) |
Evolutionary models are taken into account Confidence score generated (NSTI) Correction of OTU copy numbers |
Based on specific taxonomy (Greengenes identifiers) KEGG database not updated since 2011 No pre-calculated table of fungal genomes available |
| PICRUSt2 | Python/R | 16S/18S rRNA/ITS | MetaCyc, KO, EC number, COG, Pfam, TIGRFAM | Functional inference | HSP (maximum parsimony, empirical probabilities, subtree averaging, SCP) | Based on rDNA amplicon sequences |
Evolutionary models are taken into account Confidence score generated (NSTI) Twice as many KO scores Multiple HSP methods can be implemented (takes branch length weighting into account) 18S rRNA and ITS amplicons are taken into account Extensive documentation and active community |
Errors may occur with sequence placement owing to poor resolution of rRNA amplicons in some clades |
| Piphillin | Web-based | 16S rRNA | BioCyc, KEGG | Functional inference | Nearest-neighbor matching of 16S rRNA gene amplicons with genomes from reference databases | Based on rDNA amplicon sequences |
Regular updates of functional databases rRNA copy number adjustment |
Available online only Available for 16S rRNA only |
| SINAPS | USEARCH | 16S rRNA | Trait annotation (e.g., energy metabolism, Gram-positive staining, presence of a flagellum) | Functional inference | Word counting | Greengenes, SILVA |
Confidence is estimated by boostrapping Integrated to USEARCH tool |
No peer-reviewed publication (bioRxiv preprint) Detailed explanation is missing (e.g., how was protrait input created?) |
| Tax4Fun | R package | 16S rRNA | KO | Functional inference | Nearest-neighbor search based on a minimum 16S rRNA sequence similarity | SILVA taxonomy |
Uses R (multiplatform) with pre-calculated files Confidence score generated (FTU and FSU) The algorithm could better predict poorly characterized taxa compared to approaches based on ASR with possible large distances in the tree, thanks to a minimum of similarity between sequences |
Based on specific taxonomy (SILVA identifiers) KEGG database not updated since 2011 |
| Tax4Fun2 | R package | 16S rRNA | KO | Functional inference | BLAST | Based on rDNA amplicon sequences |
Algorithm with a minimal sequence similarity Uses R (multiplatform) with pre-calculated, highly memory-efficient platform-independent files Confidence score generated (FTU and FSU) KO update from 2018 Calculates the redundancy of specific functions directly Builds its own habitat-specific reference |
Not yet available for microbial eukaryotes |
| Vikodak | Web-based (not longer available) | 16S rRNA | KEGG pathway, EC number | Functional inference | Microbial co-existence patterns | RDP, SILVA |
Pathway exclusion cut-off value is available to provide the minimum percentage of genes/enzymes belonging to a metabolic pathway required to consider the pathway as functional Compares 2 datasets |
Not longer available Not yet available for microbial eukaryotes |
| iVikodak | Web-based | 16S rRNA | KEGG, Pfam, COG, TIGRfam | Functional inference | Microbial co-inhabitance patterns | RDP, Greengenes, SILVA |
User-friendly for non-expert bioinformaticians Integrated tools for statistical comparisons Graphical visualizations |
Available online only Not yet available for microbial eukaryotes |
| FUNGuild | Python/Web-based | ITS | Guild type | Trait assignment | Not applicable | Based on UNITE taxonomy (ITS) |
Trait quality for taxon assignment |
No regular update 18S rRNA taxonomy with related database not included. However, the database is open-access, and a homemade wrapper can be used for 18S metabarcoding output |
| FAPROTAX | Python; flat file | 16S rRNA | Ecological functions (e.g., nitrification, denitrification, or fermentation) | Trait assignment, Database | If all type strains of a species at the genus level share the function, FAPROTAX assumes that all uncultured organisms of this genus possess the putative function | SILVA (128, 132) |
Based on the literature of cultured taxa Availability of all literature to create the database Functions easily added to the tool |
Implicit assumption (see Methods column) could be false with the increase of newly cultured organisms Does not infer upper rank when taxonomic resolution is poor |
| BacDive | Python and R API, R package | Morphology, physiology (API®-tests), molecular data, and cultivation conditions | Database | Not applicable | NCBI taxonomy |
Provides links to ENA, GenBank, SILVA, BRENDA, GBIF, ChEBI, Straininfo website data A match with 16S rRNA sequences is available from SILVA |
Does not provide a tool for metabarcoding output | |
| BugBase | R/Python | 16S rRNA | KEGG | Functional inference | PICRUSt, custom trait assignment | Greengenes |
Biogically interpretable traits (Gram staining, oxygen tolerance, biofilm formation, pathogenicity, mobile element content, and oxidative stress tolerance) |
No peer-reviewed publication (bioRxiv preprint) |
| IJSEM | Flat file with R script for curation | IJSEM | Database | Not applicable | Not applicable |
16S rRNA accession numbers available |
Does not provide a tool for metabarcoding output | |
| ProTraits | Web-based; flat files | Wikipedia, MicrobeWiki, HAMAP proteomes, PubMed abstracts and publications, Bacmap, Genoscope, JGI, KEGG, NCBI, Karyn's Genomes | Database | Not applicable | Not applicable |
Phenotypic inference large ressource (∼545,000 phenotypes scanning 424 traits across 3,046 species) NCBI taxonomy available |
Does not provide a tool for metabarcoding output | |
| BURRITO | Web-based | 16S rRNA | KO | Functional inference | PICRUSt | Greengenes |
Explores simultaneous and integrative studies of taxonomic and functional profiles |
Based on PICRUSt v1 |
| MACADAM | Python/web implementation | 16S rRNA | MetaCyc, MicroCyc, FAPROTAX, IJSEM | Functional inference, Trait assignment | Custom methods (provides functional information about upper-rank taxa when organism name is not found) | NCBI taxonomy |
Pathway score and pathway frequency score are provided, allowing knowledge of number of enzymes present in the pathway |
Not yet available for microbial eukaryotes |
| FunFun | R package; flat file | Ecological traits | Trait assignment | Not applicable | Based on UNITE taxonomy (ITS) |
Uses R (multiplatform) Complementary to FUNGuild | ||
| FungalTraits | Flat files | Guild type, body type, habitat | Trait assignment | Not applicable | Based on UNITE taxonomy (ITS) |
Expert work to propose traits at the genus level Merges the FUNGuild and FunFun tools An excel file with vlookup function is available to assign guilds or trait data |
Does not provide a tool for metabarcoding output | |
| DEEMY | Web-based | Morphology, anatomy, potential for chemical reactions, or even ecology traits | Database | Not applicable | Not applicable |
Link to tree species associated Includes images |
Specialized in ectomycorrhizas only | |
| Bacteria-archaea-traits | R package; flat file | 16S rRNA | Traits, phenotypic traits, quantitative genomic traits | Database | Not applicable | NCBI taxonomy, GTDB taxonomy |
Groups the major bacterial and archaeal databases into 1 database Traits and species data condensed R workflow available to retrieve condensed trait and species data | |
| OntoBiotope | Web-based | Habitats and phenotypes | Database | ToMap (Text to ontology mapping) | NCBI taxonomy |
Term relevance is evaluated by the semantic search engine PubMedBiotope Maintained by ∼30 microbiology experts |
Dedicated to the food domain | |
| @Minter | Python | Microbial interactions | Machine learning | Support-vector machine (SVM)-based classifier | No specific taxonomy, just species level |
Original approach to get information on microbial interactions rapidly |
Species name required |
Figure 7:Annual cumulative number of citations of the major tools (A) and their scope (B). The keywords used for “scope” were retrieved from the titles and abstracts of the articles listed in Additional File 1.
Figure 8:Overview of the quality of functional prediction based on a subsampling of articles for PICRUSt (A) and Tax4Fun (B) across various ecosystems. For PICRUSt, colors were assigned according NSTI results: <0.06, quite good; 0.06–0.10, good; 0.10–0.15, reasonable but probably approximate; and >0.20, probably unreliable. For Tax4Fun, we split the fraction of OTUs that could not be mapped to KEGG organisms in 5 harmonious groups. References are listed in Additional File 1. The distribution of data are displaying by boxplots and are standardized way of based on a five number summary (minimum, first quartile, median, third quartile, and maximum) and the outliers (shown as black circles).
Figure 9:Summary diagram of the most relevant microbial soil functions results based on functional inference and ecological trait assignment.
The figure is made up of 2 parts: studies on bacterial communities based on functional inference on the left and studies on fungal communities based on ecological trait assignment on the right. For all studies (climate change, anthropogenic gradient, agricultural practices, plant diversity, or the biogeochemical cycle), if an effect or a correlation was found on the gene reservoir or on microbial communities with a particular ecological trait, a colored arrow indicates the effect and a cross indicates no significant effect. A triangle indicates either a decrease or an increase of the gene reservoir or microbial communities with a particular trait. References are listed in Additional File 1.
Figure 10:Summary diagram of the expected results (first box), the functional prediction prospects (second box), and the limits of the microbial genomic data available for different habitats (third box). The first box illustrates a comparative example of data results of community structures and functional structures through a PCA (A). This example illustrates the case when the functional community structure differentiates experimental conditions better than it differentiates the microbial community structure. Illustrative heat maps showing the relative abundance of genes per sample (B) or per OTU (C).