| Literature DB >> 34821235 |
Serina L Robinson1, Jörn Piel1, Shinichi Sunagawa1.
Abstract
Covering: up to 2021Metagenomics has yielded massive amounts of sequencing data offering a glimpse into the biosynthetic potential of the uncultivated microbial majority. While genome-resolved information about microbial communities from nearly every environment on earth is now available, the ability to accurately predict biocatalytic functions directly from sequencing data remains challenging. Compared to primary metabolic pathways, enzymes involved in secondary metabolism often catalyze specialized reactions with diverse substrates, making these pathways rich resources for the discovery of new enzymology. To date, functional insights gained from studies on environmental DNA (eDNA) have largely relied on PCR- or activity-based screening of eDNA fragments cloned in fosmid or cosmid libraries. As an alternative, shotgun metagenomics holds underexplored potential for the discovery of new enzymes directly from eDNA by avoiding common biases introduced through PCR- or activity-guided functional metagenomics workflows. However, inferring new enzyme functions directly from eDNA is similar to searching for a 'needle in a haystack' without direct links between genotype and phenotype. The goal of this review is to provide a roadmap to navigate shotgun metagenomic sequencing data and identify new candidate biosynthetic enzymes. We cover both computational and experimental strategies to mine metagenomes and explore protein sequence space with a spotlight on natural product biosynthesis. Specifically, we compare in silico methods for enzyme discovery including phylogenetics, sequence similarity networks, genomic context, 3D structure-based approaches, and machine learning techniques. We also discuss various experimental strategies to test computational predictions including heterologous expression and screening. Finally, we provide an outlook for future directions in the field with an emphasis on meta-omics, single-cell genomics, cell-free expression systems, and sequence-independent methods.Entities:
Mesh:
Substances:
Year: 2021 PMID: 34821235 PMCID: PMC8597712 DOI: 10.1039/d1np00006c
Source DB: PubMed Journal: Nat Prod Rep ISSN: 0265-0568 Impact factor: 13.423
Fig. 1Tiered definitions of enzyme discovery. The hierarchical structure is not meant to reflect superiority of higher tiers rather it is a reference to the relative number of metagenomic enzyme studies falling within each category.
Comparison of shotgun metagenomic sequencing with activity-guided and PCR-based functional metagenomics
| Methods of enzyme discovery | Shotgun metagenomic sequencing | Activity-guided screening | PCR-based screening |
|---|---|---|---|
| Pros | • Complete functional profile of an environment | • Can lead to detection of new enzymes or folds catalyzing known reactions | • Sensitive for low-abundance sequences |
| • Genomic context and taxonomy obtained through binning/assembly | • Well-developed methods to screen for industrially-relevant enzymes, | • Detect variation within a single gene family at the level of single nucleotide changes | |
| • Higher accuracy achievable with proximity-guided assembly and long-read sequencing methods | • Inexpensive | • Relatively inexpensive | |
| • Can be combined with other meta-omics analyses | • Activity-forward method guarantees enzymes are active and express well in | ||
| • Generally less biased than activity- and PCR-based methods | |||
| Cons | • High sequencing depth required to detect genes in low abundance | • Limited to genes and small to medium-sized gene clusters that are expressed in the screening host | • Requires conserved DNA motifs in target sequences |
| • Computationally-intensive assembly and binning | • Typically limited to types of reactions that can be screened rapidly | • Not effective for detecting novel enzyme seqences or folds | |
| • Challenging to infer function from sequence alone | • Can requires specific high-throughput screening equipment | • Little to no taxonomic information | |
| • No taxonomic information | • PCR-bias against GC-rich sequences | ||
| • Can only screen for one type of reaction/function at a time | Short reads make gene cluster context difficult to recover |
Fig. 2Heatmap of PFAM domains extracted from the MIBiG database[35] cross-referenced with predicted EC reactions for each PFAM domain using ECDomainMiner.[68] Color intensity corresponds to the number of distinct predicted reactions (at the level of two EC class digits) associated with each PFAM domain. Y-Axis heatmap labels include standard PFAM domain abbreviations and PFAM family ID and number of occurrences of each PFAM domain in MIBiG BGCs in parentheses. X-Axis heatmap labels refer to the standard top-level EC number codes (excluding EC7 translocases which were not included in this analysis).
Fig. 4Selected enzymes highlighted in this review. (A) IkaB oxidoreductase involved in ikarugamycin polycyclization. (B) ThiF-nitroreductase di-domain enzyme, OxzB, catalyzes cyclization of oxazolone-containing metabolites with homologs detected in metagenomes from various environments (mainly marine). (C) PdxI catalyzes an alder–ene reaction to form a vinyl cyclohexane intermediate in biosynthetic pathways for fungal alkaloids including pyridoxatin and cordypyridones. (D) Arginase-family enzyme, OspR, promiscuously installs ornithines in the backbones of peptide natural products. OspR homologs were characterized from various microbial isolates and from the uncultivated phylum ‘Candidatus Wallbacteria’ from groundwater metagenomes. (E) FrsA thioesterase domain originally detected in an uncultivated leaf symbiont catalyzes intramolecular thioesterification of the Gq protein inhibitor FR900359.
Fig. 3Flowchart of strategies for in silico selection and experimental characterization of candidate metagenomic enzymes.
Selected pros and cons of different computational methods for enzyme discovery covered in this review
|
| Phylogenetics | Sequence similarity networking | Genome neighborhoods and protein interaction networks | 3D-structural methods, motifs, and active site residues | Machine learning |
|---|---|---|---|---|---|
| Pros | • Longstanding, well-established methods to investigate functional relationships between proteins | • Intuitive graphical representation of thousands of protein sequences simultaneously | • Guilt-by-association methods can reveal new functional relationships for proteins independent of primary sequence | • Variations in active site architecture can have large consequences for biocatalysis → handles for discovery | • Deep learning, transfer learning, and autoencoding methods useful to learn complex or hidden relationships for functional inference |
| • Insights into evolution of protein families, | • Allows users to quickly identify clusters without known representatives in sequence space | • Unusual co-occurring domains or interacting proteins are new targets for enzyme discovery | • Structural motifs are useful for searches independent of full-length primary sequence | • Capable of recognizing patterns in big metagenomic datasets | |
| Cons | • Heavily influenced by the quality of the underlying sequence alignment | • Pruning of SSNs by BLAST e-value can be subjective | • Analysis of gene neighborhoods from metagenomes requires assembly → introduces errors and not always possible to recover flanking genes for lowly-abundant organisms | • Similar structural folds catalyze a wide range of different reactions | • Requires a large quantity of ‘labeled’ |
| • Not all biosynthetic domains have a consistent or strong phylogenetic signal | • Unclear how to handle or gain functional insights from ‘singletons’ | • Relatively few structures solved from metagenomic sources | • Classification systems limited in their ability to predict entirely new enzyme functions |
Fig. 5Common steps in a machine learning workflow for protein function prediction covered in this review.
Fig. 6Trade-off between generalizability and throughput for common enzyme screening approaches.