Literature DB >> 28826221

Genomic Enzymology: Web Tools for Leveraging Protein Family Sequence-Function Space and Genome Context to Discover Novel Functions.

Abstract

The exponentially increasing number of protein and nucleic acid sequences provides opportunities to discover novel enzymes, metabolic pathways, and metabolites/natural products, thereby adding to our knowledge of biochemistry and biology. The challenge has evolved from generating sequence information to mining the databases to integrating and leveraging the available information, i.e., the availability of "genomic enzymology" web tools. Web tools that allow identification of biosynthetic gene clusters are widely used by the natural products/synthetic biology community, thereby facilitating the discovery of novel natural products and the enzymes responsible for their biosynthesis. However, many novel enzymes with interesting mechanisms participate in uncharacterized small-molecule metabolic pathways; their discovery and functional characterization also can be accomplished by leveraging information in protein and nucleic acid databases. This Perspective focuses on two genomic enzymology web tools that assist the discovery novel metabolic pathways: (1) Enzyme Function Initiative-Enzyme Similarity Tool (EFI-EST) for generating sequence similarity networks to visualize and analyze sequence-function space in protein families and (2) Enzyme Function Initiative-Genome Neighborhood Tool (EFI-GNT) for generating genome neighborhood networks to visualize and analyze the genome context in microbial and fungal genomes. Both tools have been adapted to other applications to facilitate target selection for enzyme discovery and functional characterization. As the natural products community has demonstrated, the enzymology community needs to embrace the essential role of web tools that allow the protein and genome sequence databases to be leveraged for novel insights into enzymological problems.

Mesh：

Substances：
Enzymes

Year: 2017 PMID： 28826221 PMCID： PMC5569362 DOI： 10.1021/acs.biochem.7b00614

Source DB: PubMed Journal: Biochemistry ISSN： 0006-2960 Impact factor: 3.162

In 2001 Patricia Babbitt and I discussed nature’s strategies for divergent evolution of new enzymatic functions from a common progenitor to yield mechanistically diverse enzyme superfamilies (conserved active site architectures that catalyze reactions with shared partial reactions, intermediates, or transition states) and functionally diverse suprafamilies (conserved active site architectures that catalyze mechanistically distinct reactions).[1] When our review was published, only a few superfamilies/suprafamilies had been recognized, including the enolase, amidohydrolase, thiyl radical, enoyl-CoA hydratase (crotonase), vicinal-oxygen-chelate superfamilies, and the orotidine 5′-monophosphate (OMP) decarboxylase suprafamily, not surprising because the UniProt database then contained only 571 804 protein sequences (July 2001) (http://www.uniprot.org/; see Table for a summary of abbreviations). Despite, in retrospect, a meager number of sequences, we concluded that enzymologists were positioned to expand their interests beyond studies of single enzymes to encompass entire enzyme families. We proposed that sequenced genomes (1) provided a rapidly expanding source of new proteins for investigation and (2) allowed genomic context to be used to infer novel enzymatic functions and, therefore, better understand the evolution of functional diversity in enzyme superfamilies. We suggested the term genomic enzymology to describe the expansive strategy of using protein families and genome context to focus studies of enzyme mechanisms, discover new functions, and more accurately describe the evolution of enzyme function in molecular terms (sequence and structure). However, we did not propose how the protein and genome sequence databases could be leveraged and used by the experimental community.

Table 1

List of Abbreviations

ABC	ATP-binding cassette
AGeNNT	Automatically Generates refined Neighborhood NeTworks
antiSMASH	Antibiotics & Secondary Metabolite Analysis SHell
BGC	biosynthetic gene cluster
BLAST	Basic Local Alignment Search Tool
DSF	differential scanning fluorimetry
DUF	domain of unknown function
EFI	Enzyme Function Initiative
EFI-EST	EFI-Enzyme Similarity Tool
EFI-GNT	EFI-Genome Neighborhood Tool
ENA	European Nucleotide Archive
GNN	Genome Neighborhood Network
GRE	glycyl radical enzyme
InterPro	Integrated Protein Database
JGI-IMG/M	Joint Genome Institute-Integrated Microbial Genomes/Metagenomes
MSA	multiple sequence alignment
NCBI	National Center for Bioinformatics Information
NRPS	nonribosomal peptide synthase
OMP	orotidine 5′-monophosphate
orf	open reading frame
P5C	Δ¹-pyrroline-5-carboxylate
Pfam	Protein Family Database
PKS	polyketide synthase
PN	proteome network
PRISM	PRediction Informatics for Secondary Metabolomes
RLP	RuBisCO-like protein
RODEO	rapid ORF description and evaluation online
RuBisCO	ribulose bisphosphate carboxylase/oxygenase
SBP	solute binding protein
SFLD	Structure–Function Linkage Database
ShortBRED	“Short, Better Representative Extract Data Set”
SSN	Sequence Similarity Network
TCT	tricarboxylate transport
TRAP	tripartite ATP-independent periplasmic transporter
TRN	Taxonomic Rank Network
UniProt	Universal Protein Resource
UniProtKB	UniProt Knowledgebase

Sixteen years later, the UniProt database contains 88 588 026 nonredundant sequences (Figure ; Release 2017_07); the number of sequences is increasing at the rate of 2.4% per month (doubling time 2.5 years), largely the result of microbial genome projects. The challenge is to devise “user friendly” methods to interrogate the massive amount of data so that hypotheses can be generated that direct experimental determination of in vitro activities and in vivo metabolic functions of uncharacterized enzymes. For example, 379 mechanistically diverse superfamilies and functionally diverse suprafamilies have been described;[2] additional superfamilies and suprafamilies must be present in (1) genomic “dark matter” that has not been curated by databases such as Pfam and (2) the genomes of phylogenetically diverse bacterial species that have not yet been systematically sequenced.[3] This large, and growing for the foreseeable future, set of superfamilies includes members that catalyze novel reactions in novel pathways, a boon to enzymologists.

Figure 1

Growth of the UniProt protein sequence database (Release 2017_07). The blue line represents the EMBL/TrEMBL sequences with automated annotations; the red line represents the EMBL/SwissProt with manually curated annotations. Currently, the doubling time is ∼2.5 years. The number of sequences decreased by ∼50% in April 2015 when UniProt identified reference proteomes for closely related species and archived the redundant proteomes. Approximately 50% of the proteins in the databases have incorrect, uncertain, or unknown functional annotations.[4] The UniProt Knowledgebase (UniProtKB) is composed of two sections, UniProtKB/SwissProt and UniProtKB/TrEMBL. The annotations in UniProtKB/SwissProt are manually curated; the functional annotations in UniProtKB/TrEMBL are computationally assigned based on the function of the “closest” homologue. In the most recent UniProt release (2017_07), only 0.63% of the sequences are in the UniProtKB/SwissProt section (Figure ); this fraction continues to decrease because the total number of sequences added in each release greatly exceeds the number of new sequences with SwissProt-curated, experimentally verified annotations. In principle, curated annotations might be extended to orthologues; however, the sequence boundaries between functions are unknown, so homology-based approaches for functional assignment are risky. Therefore, incorrect, uncertain, or unknown annotations will continue to propagate, compromising their utility to allow the discovery of new enzymatic functions, metabolic pathways, metabolites, and biology. Khosla recently summarized this challenge:[5] “Although enzymology will remain a predominantly experimental science for the foreseeable future, one cannot avoid a sense of helplessness when one considers the huge (and growing) deficit in functionally annotated sequences. By now, there are approximately 100 million nonredundant protein sequence entries in GenBank, but a reliably curated protein database such as SwissProt contains fewer than 1 million entries. This is a quintessential ‘big data’ problem, where the rate at which data is generated continues to outpace the rate at which it is curated. It is unlikely that more resource-intensive curation alone can solve the problem. As the proverb says, this may be a situation where the most desirable approach will involve user-friendly tools that teach a novice how to fish instead of serving fish. Such tools could ideally capture the essence of an enzymologist’s judgment in layers of increasing sophistication, depending on the user’s actual needs.” This Perspective describes “genomic enzymology” web tools that initially were developed by the Enzyme Function Initiative (EFI)[6] and provides examples of their applications.

Web Tools for Natural Product Discovery

In parallel with the development of genomic enzymology, the natural products community discovered that genes encoding biosynthetic pathways for natural products often are organized in “biosynthetic gene clusters” (BGCs).[7−9] Given the structural complexity of natural products and the need to identify the enzymes that assemble their backbones, e.g., terpene synthases, nonribosomal peptide synthases (NRPSs), and polyketide synthases (PKSs), as well as the enzymes that catalyze “tailoring” reactions, e.g., glycosylases, methylases, and redox enzymes, the genomic colocalization of the biosynthetic genes facilitates pathway discovery and experimental characterization. Although the type of scaffold may be apparent from the annotations in the BGCs, the structure of the natural product is not trivial to predict. Indeed, many enzymes (backbone-forming and tailoring) are novel members of diverse enzyme superfamilies. Nonetheless, the discovery of a BGC facilitates enzyme identification so that they can be experimentally tested for sequential activities in the biosynthetic pathway. The number of natural products is estimated to be extremely large;[10,11] therefore, identification of BGCs is an attractive strategy for their discovery. In the past several years, bioinformatic tools have been developed for discovering BGCs in sequenced genomes,[12,13] including antiSMASH (Antibiotics & Secondary Metabolite Analysis SHell[14]), PRISM (PRediction Informatics for Secondary Metabolomes[15]), and RODEO (Rapid ORF Description and Evaluation Online[16]). These tools are widely used by the natural products/synthetic biology community, e.g., more than 300 000 jobs have been processed by the antiSMASH server (https://antismash.secondarymetabolites.org/). Although these tools enable the discovery of BGCs, the annotations of the uncharacterized enzymes in the BGCs are limited to their membership in protein families, an overview that often is insufficient to restrict substrate specificities and/or reaction identities/mechanisms. Therefore, many of the challenges in BGC characterization are the same as those encountered by enzymologists focused on small-molecule metabolic pathways (vide infra).

What Should Genomic Enzymology Tools Provide?

Genomic enzymology focuses on the discovery of function in the context of entire enzyme families: this approach allows recognition of sequence and structure attributes that are conserved for specific functions. Babbitt developed the Structure–Function Linkage Database (SFLD; http://sfld.rbvi.ucsf.edu/) to generate and disseminate sequence–structure relationships that associate specific functional properties with specific sequence and structure motifs in functionally diverse enzyme superfamilies.[17] As an early example of the use of genomic enzymology to obtain mechanistic insights, the recognition that (1) the reactions catalyzed by mandelate racemase and muconate lactonizing enzyme in the enolase superfamily require stabilization of an enolate anion intermediate and (2) their sequences have conserved motifs for binding an active site Mg2+ defined the catalytic strategy for the superfamily.[1,18,19] The functional diversity in the superfamily, including dehydration, deamination, cycloisomerization, racemization, and epimerization of carboxylate-anion substrates, could be explained by divergent evolution selecting (1) acid/base catalysts for both generating the enolate anion intermediate and directing it to products and (2) specificity determinants for binding different substrates in productive geometries relative to the acid/base catalysts.[20,21] This same strategy for evolution of new enzymatic functions applies to many mechanistically diverse superfamilies.[2] The challenges for genomic enzymology are developing and applying large-scale methods for (1) grouping members of mechanistically diverse superfamilies and functionally diverse suprafamilies in isofunctional families, e.g., identifying acid/base catalysts and placing restrictions on reaction mechanisms and substrate specificities and (2) analyzing the genome contexts for the members of isofunctional families so that their roles in metabolic pathways can be deduced. e.g., predicting substrates, intermediates, and products.

Sequence Similarity Networks (SSNs)

Evolutionary biologists typically use phylogenetics-based approaches to distinguish orthologues from paralogues.[22,23] Phylogenetic trees are constructed from multiple sequence alignments (MSAs); however, MSAs are difficult to generate for large protein families.[23] Many superfamilies and suprafamilies are large: >15 K sequences in the glycyl-radical enzyme superfamily, >22 K sequences in the OMP decarboxylase suprafamily, >44 K sequences in the enolase superfamily, >122 K sequences in the enoyl-CoA hydratase (crotonase) superfamily, and >250 K sequences in the radical SAM superfamily. In addition to being difficult to construct, trees for large families also are difficult to interpret because of their complexity.[24] Trees do not provide immediate access to all sequences in a family—representative sequences usually are selected in the construction of the tree. Instead, what is needed is a large-scale approach that allows easy visualization and analyses for all sequences in a family, recognizing that it must be “user friendly”, i.e., intuitive and fast. Atkinson and Babbitt introduced sequence similarity networks (SSNs) to enable large-scale analyses of sequence–function relationships in protein families.[25] An SSN displays pairwise relationships obtained from an all-by-all sequence comparison, e.g., BLAST. Although the use of BLAST can be criticized because it provides a measure of overall sequence similarity and, therefore, may be insensitive to different domain architectures important in determining molecular function, it is (1) fast, a requirement for routine all-by-all comparisons of the sequences of members of increasingly large protein families (each sequence must be compared with every other sequence so the time required increases with the square of the number of sequences), and (2) familiar to experimentalists. An SSN contains “nodes” for sequences; “edges” that quantitate sequence similarity (pairwise sequence identity) connect nodes that share sequence similarity that exceeds a user-specified level (Figure ). As the sequence similarity required to connect nodes with edges is increased, the nodes segregate into clusters; the goal is to select a level of sequence similarity that segregates the nodes/members of the family into isofunctional clusters (Figure ).

Figure 2

A sequence similarity network (SSN) showing the protein sequence nodes and pairwise sequence similarity edges.

Figure 3

SSNs for sequences from the proline racemase family (Pfam family PF05544). (A) Alignment score ≥15, ≥22% pairwise sequence identity. (B) Alignment score ≥20, ≥25% pairwise sequence identity. (C) Alignment score ≥50, ≥35% sequence identity. (D) Alignment score ≥70, ≥40% sequence identity. (E) Alignment score ≥90, ≥48% sequence identity. (F) Alignment score ≥110, ≥58% sequence identity. The colors in panel F are used to color the nodes in panels A–E.

A sequence similarity network (SSN) showing the protein sequence nodes and pairwise sequence similarity edges. SSNs for sequences from the proline racemase family (Pfam family PF05544). (A) Alignment score ≥15, ≥22% pairwise sequence identity. (B) Alignment score ≥20, ≥25% pairwise sequence identity. (C) Alignment score ≥50, ≥35% sequence identity. (D) Alignment score ≥70, ≥40% sequence identity. (E) Alignment score ≥90, ≥48% sequence identity. (F) Alignment score ≥110, ≥58% sequence identity. The colors in panel F are used to color the nodes in panels A–E. SSNs contain “node attributes”, including functional and phylogenetic information associated with each sequence/node, that assist the user in analyzing sequence–function relationships, including choosing sequence similarity thresholds for drawing edges and segregating the families into isofunctional clusters. Atkinson and Babbitt compared SSNs with phylogenetic trees and concluded “the most valuable feature of SSNs is not the optimal or most accurate display of sequence similarity, but rather the flexible visualization of many alternate protein attributes for all or nearly all sequences in a superfamily”.[25] SSNs are viewed using Cytoscape (http://cytoscape.org/), “an open source platform for visualizing complex networks and integrating these with attribute data”.[26] Although Cytoscape has a steep “learning curve”, it provides Control Panels to select nodes based on the node attributes and to filter and color the networks to enable visual analyses. With node attributes and the Control Panels, SSNs viewed with Cytoscape satisfy Khosla’s vision that genomic enzymology tools “could ideally capture the essence of an enzymologist’s judgment in layers of increasing sophistication, depending on the user’s actual needs”.[5] The SFLD provides SSNs for a several functionally diverse superfamilies with manually curated (labor intensive and expensive) annotations/node attributes;[17] these SSNs serve as “gold standards” for functional annotation in both the bioinformatics and enzymology communities.[27] However, with the large number of superfamilies/suprafamilies (vide infra) and families that provide additional metabolic enzymes, e.g., dehydrogenases, kinases, and aldolases, community-initiated generation of SSNs is necessary. The SFLD does not provide this capability; Pythoscape was developed by the SFLD for generating large SSNs, but it is not “user friendly” for most experimentalists because it requires access to a computer cluster and programming expertise.[28] In principle, the construction of SSNs is “simple”, i.e., connecting sequences with edges that quantitate similarity. However, most experimentalists would be hard-pressed to develop their own programs for generating SSNs. And, other web tools that construct SSNs, e.g., Pclust[29] and CLANS,[30] use a limited number of sequences and/or node attributes. The EFI developed a web tool, the Enzyme Function Initiative-Enzyme Similarity Tool (EFI-EST; http://efi.igb.illinois.edu/efi-est/),[31] to generate SSNs for large protein families. To date, >1600 unique users have submitted jobs to EFI-EST, and >50 publications have appeared that reference the use of EFI-EST.[13,14,32−78] EFI-EST uses sequences and node attribute information from UniProt: in contrast to the NCBI database, annotations in the UniProt database can be changed with data provided by any member of the community, allowing important corrections and additions that diminish propagation of annotation errors. EFI-EST now provides four options for selecting sequences to be included in the SSN: Option A, a single user-supplied sequence is used to collect homologues with BLAST from the UniProt database (maximum 10 000 sequences); Option B, the user specifies one or more UniProt and/or InterPro families [currently limited to ≤255,000 sequences to allow the SSN for the radical SAM superfamily (Pfam family PF04055) to be generated]; Option C (enhanced in the most recent update), the user provides a FASTA file of sequences and selects whether accession IDs in the headers are used to retrieve node attributes from UniProt; and Option D (new in the most recent update), the user provides a list of UniProt and/or NCBI accession IDs. After the all-by-all comparison using BLAST, the user selects an “alignment score” based on pairwise percent identity to filter the edges (the threshold for drawing edges to connect nodes). The user then downloads the SSN for analysis with Cytoscape. EFI-EST now provides a “Color SSN Utility” to facilitate analyses of SSNs by (1) coloring each cluster in an input SSN with a unique color, (2) providing a file with color information that allows the user to color SSNs of the same sequences generated with lower similarity (pairwise identity) to track segregation of clusters (e.g., Figure ), and (3) FASTA files for the sequences in each cluster to facilitate the generation of MSAs.

Applications of SSNs

The EFI used SSNs from the SFLD to characterize sequence–function space in targeted functionally diverse superfamilies (amidohydrolase,[79−85] enolase,[19,86−92] glutathione S-transferase,[93] haloalkanoate dehalogenase,[94] and isoprenoid synthase[95,96]) and select targets for functional discovery. Then, when EFI-EST became available, both the EFI and community began to use SSNs to characterize sequence–function space in a wide range of proteins families. SSNs generated by the community using EFI-EST[13,14,32−78] have been used to identify and describe potential isofunctional families within enzyme families, e.g., clusters with different (but unknown) substrate specificities, thereby providing an overview of sequence–function space in specificity diverse superfamilies (different substrates but same type of overall reaction) and functionally diverse superfamilies (different substrates and different reaction mechanisms, although a partial reaction may be conserved). SSNs also provide the ability to survey the members of a protein family for different domain architectures that may suggest different functional contexts, i.e., fusion proteins in different pathways. And, the pathway for cluster segregation as sequence similarity increases (Figure ) may suggest functional linkages between clusters. Several community-generated SSNs from the recent literature that illustrate their use are shown in Figure ; readers are referred to the publications for detailed descriptions.[13,14,32−78]

Figure 4

Examples of SSNs generated with EFI-EST that were included in recent publications. (A) SSN for isopeptidases involved in lasso peptide synthesis.[43] (B) SSN of precursor peptides for microviridin synthesis.[60] (C) SSN of LanMs in lantibiotic synthesis.[76] (D) SSN for ferredoxins compared with a phylogenetic tree.[40] (E) SSN for IspH in isoprenoid biosynthesis.[56] (F) SSNs for members of the DRE-TIM metallolyase superfamily.[52] Figures reproduced with permission from refs (40), (43), (52), (56), (60), and (76).

Genome Neighborhood Networks (GNNs)

With the potential to segregate protein families into isofunctional clusters using SSNs, the second genomic enzymology challenge is to place these clusters in a functional context, e.g., identify the small-molecule metabolic pathways in which uncharacterized enzymes participate. In eubacteria, archaea, and fungi, the enzymes in a metabolic pathway often are encoded by a gene cluster or operon (just as the biosynthetic pathways for natural products are encoded by BGCs). Therefore, the proteins encoded by the genes proximal to those that encode members of an isofunctional cluster (orthologues) may allow the number and types of reactions in the metabolic pathway to be determined if these are conserved by the members of the cluster. Genome neighborhoods for homologues can be examined using web resources such as JGI-IMG/M (https://img.jgi.doe.gov/cgi-bin/m/main.cgi); however, complete pathways are not always encoded by a single genome neighborhood. Large-scale mining of genome neighborhoods for all orthologues in an SSN cluster has the advantage that operon/gene cluster organization may not be preserved across phylogenetic species; i.e., the sequences in an isofunctional SSN cluster may have diverse genome neighborhoods and pathway neighbors, but the ability to survey all of the neighborhoods provides the potential to identify all of the functionally linked genes/enzymes that can be assembled into a metabolic pathway. In 2014, the EFI described a genome neighborhood analysis that was applied to the proline racemase family (Pfam family PF05544) using an all-by-all comparison (with BLAST) of the neighbors to generate a network (the genome neighborhood network, GNN);[97] the neighbors were segregated into protein families using an e-value >20 for the edges in the SSN. By assigning unique colors to the clusters in the SSN (Figure A) and coloring the neighbors in the GNN with the same color, the neighbors for the sequences in each cluster were identified (Figure B). Then, candidates for functionally linked enzymes were recognized and potential pathways were predicted. This analysis allowed in vitro enzymatic activities and in vivo metabolic functions (the three pathways shown in Figure C) to be assigned to 85% of the sequences in the family [2333 sequences in InterPro Release 43.0 (July 2013)].

Figure 5

(A) A colored SSN for the proline racemase family (PF05544; InterPro Release 43.0). (B) The GNN generated by an all-by-all BLAST of the genome neighbors. (C) Three pathways catalyzed by members of the proline racemase family. The nodes in the GNN (panel B) are colored using the color clusters in the SSN (Panel A). Figures reproduced with permission from ref (97). The EFI subsequently developed the Enzyme Function Initiative-Genome Neighborhood Tool (EFI-GNT; http://efi.igb.illinois.edu/efi-gnt/) to provide a “user friendly” interface for generating GNNs to facilitate the identification of pathway/metabolic context for isofunctional clusters in SSNs. Although EFI-GNT has not yet been “officially” announced with a detailed publication (a manuscript describing the updated version of EFI-EST and EFI-GNT is in preparation for publication later this year), >250 unique users have accessed the web tool that is available for community use. An SSN generated by EFI-EST is the input for EFI-GNT [Figure A; 6419 sequences in the proline racemase family in InterPro Release 63.0 (May 2017)]. EFI-GNT assigns a unique color (from a palette of 1513 colors) to each cluster (Figure B). It then interrogates the European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena) database for the neighbors of each sequence in each cluster in the input SSN (for eubacteria, archaea, and fungi), and the neighbors are associated with their Pfam families. The co-occurrence frequencies of the queries in the SSN cluster with the neighbors as well as the absolute values of the distances in open reading frames (orfs) between the queries and neighbors are calculated. Functionally linked genes encoding a pathway are expected to have (1) large query-neighbor co-occurrence frequencies (diminished if operon/gene cluster organization is phylogenetically diverse) and (2) short distances between the queries and neighbors.

Figure 6

(A) SSN for the proline racemase family (PF05544, InterPro Release 63.0) segregated with an alignment score of ≥110 (≥58% pairwise sequence identity). (B) Colored SSN generated by the EFI-GNT web tool. (C, D) GNN with SSN cluster hub-nodes and Pfam family spoke-nodes. (E, F) GNN with Pfam family hub-nodes and SSN cluster spoke-nodes. The GNNs were generated with a ±10 orf genome neighborhood window and a query-neighbor co-occurrence threshold of 20%. EFI-GNT provides GNNs in two formats. In one format (Figure C,D), a cluster is present for each SSN cluster: the hub-node represents the sequences in the SSN cluster (colored with a unique color so that it can be easily identified in a colored version of the input SSN that is generated), and the spoke-nodes represent the neighbor Pfam families; this format allows the user to identify the pathway enzymes. In the second format, a cluster is present for each neighbor Pfam family: the hub-node represents the Pfam family, and the spoke nodes represent the SNN clusters that identified the neighbors (Figure E,F); this format allows the user to assess whether the similarity (edge) threshold used to generate the input SSN was too large (pairwise identity too large) so that orthologues are segregated in multiple clusters, with these identifying the same Pfam family neighbors and pathway. In both GNN formats, the co-occurrence frequencies of the SSN queries and neighbors are the values of the edges between the hub- and spoke-nodes: if the co-occurrence frequency exceeds a user-specified threshold, the edge and spoke-node are present. From the co-occurrence frequencies, the user can identify neighbors that “always” occur with the query (the same conserved operon/gene cluster) as well as those that are less frequently associated (operon/gene cluster in some species; dispersed genes in other species). EFI-GNT also provides files with the UniProt IDs for the sequences in each neighbor Pfam family that can be used to identify the neighbors in the SSNs for their families. This mapping (1) assists the selection of alignment score thresholds for segregating the neighbor SSNs into isofunctional clusters/families and (2) provides useful context about possible functional (substrate specificity and reaction mechanism) relationships that may be useful in deducing in vitro activities and in vivo metabolic functions.

Integrated Use of SSNs and GNNs To Discover Metabolic Pathways

The synergistic “power” of the EFI-EST and EFI-GNT web tools for functional annotation of bacterial and fungal enzymes is the ability to (1) segregate protein families into isofunctional clusters in an SSN using EFI-EST (the sequences in a cluster have the same genome context) and (2) use the SSN as the input for EFI-GNT to interrogate and visualize genome neighborhood context for the isofunctional clusters in the GNN. To the best of our knowledge, no other web tools provide this integrated capability. The GNN format in which the hub-node represents the SSN cluster and the spoke-nodes represent the Pfam families (Figure C,D) can be used to identify the enzymes, transcriptional regulators, and transporters in a metabolic pathway. For example, continuing with the proline racemase family (PF05544; SSN in Figure A,B), the enzymes in a catabolic pathway for the conversion of trans-4-hydroxyproline to α-ketoglutarate (middle pathway in Figure C) can be identified for cluster 16 in the input SSN (Figure D, 792 sequences with genome neighborhoods in the ENA files). In addition to 4-hydroxyproline epimerase (the queries in cluster 16 and the SSN hub-node in the GNN cluster in Figure D), the Pfam family spoke-nodes of the GNN cluster identify the three remaining enzymes in the pathway: (1) cis-4-hydroxyproline oxidase, a member of the d-amino acid oxidase family (“DAO” in Figure D; PF01266, co-occurrence frequency, 0.91, median distance 1.0 orfs); (2) cis-4-hydroxyproline imino acid dehydratase/deaminase, a member of the dihydrodipicolinate synthase family (“DHDPS”; PF00701, co-occurrence frequency, 0.82, median distance 2.0 orfs); and (3) α-ketoglutarate semialdehyde dehydrogenase, a member of the aldehyde dehydrogenase family (“Aldedh”; PF00171, co-occurrence frequency, 0.66, median distance 2.0 orfs). The curations provided by Pfam provide essential clues for deducing the identities of the reactions catalyzed by the various neighboring enzymes (conserved reaction mechanisms). The GNN in Figure D also includes (1) the ATP-bonding component of an ABC transport system (“ABC_trans”, PF00005, co-occurrence frequency, 0.35, median distance 4.0 orfs), (2) an additional membrane component of the ABC transport system (“BPD_transp_1”, PF00528, co-occurrence frequency, 0.31, median distance 3.0 orfs), and (3) a bidomain transcriptional regulator (“GntR-FCD”, PF00392 and PF07729, co-occurrence frequency, 0.67, median distance 3.0 orfs). The GNN analysis also recognizes genome neighbors that are not associated with any Pfam family (“none” in Figure D; ∼15% of the proteins in UniProt are not associated with a Pfam family). These sequences can contain protein families currently not curated by Pfam; these families can be defined by generating SSNs for these sequences using Option D of EFI-EST. The GNN in Figure D was generated with a minimum co-occurrence frequency of 0.30. At lower co-occurrence frequencies (Figure ), members of four families of solute binding proteins [SBPs; Peripla_BP_6 (PF13458), SBP_bac_3 (PF00497), Peripl_BP_8 (PF13416), and SBP_bac_5 (PF00496)] for ABC transport systems also are genome proximal to the SSN queries with co-occurrence frequencies of 0.16, 0.11, 0.07, and 0.03, respectively, and median distances of 6.0, 5.0, 2.0, and 6.0 orfs, respectively. Also members of the major facilitator superfamily (MFS_1, PF07690) and an amino acid permease family (AA_permease_2 family, PF13520) are genome proximal to the SSN queries with co-occurrence frequencies of 0.15 and 0.11, respectively, and median distances of 9.0 and 2.0 orfs, respectively. The enzymes in metabolic pathways usually are conserved (orthologues instead of analogues; vide infra), but transport systems and transcriptional regulators often are not conserved, so members of multiple families of transporters and regulators may be genome proximal to the queries in the SSN cluster.

Figure 7

GNN for SSN cluster 16 presented at different query-neighbor co-occurrence frequencies. (A) 3%. (B) 5%. (C) 10%. (D) 12%. (E) 15%. (F) 20%.

GNN for SSN cluster 16 presented at different query-neighbor co-occurrence frequencies. (A) 3%. (B) 5%. (C) 10%. (D) 12%. (E) 15%. (F) 20%. Figure illustrates the ability of GNNs to analyze genome neighborhoods as a function of co-occurrence frequency, thereby allowing the identification of pathways that may be encoded by single genome neighborhoods in some species and multiple genome neighborhoods in other species. An example of the utility of this capability is described in the next section.[34]

Use of Transport System SBPs To Anchor Pathway Prediction Using SSNs and GNNs

For uncharacterized pathways, pathway prediction is facilitated by independent information about the substrate for the first enzyme in the pathway. For microbial enzymes in catabolic pathways, such information can be obtained from the identity of the solute for the transporter (or the ligand for a transcriptional regulator). For ABC, TRAP, and TCT transport systems, the solute is conveyed to the membrane components with a soluble extracellular (Gram-positive)/periplasmic (Gram-negative) solute binding protein (SBP); SBPs can be purified on large scale and subjected to ligand screening with differential scanning fluorimetry (DSF)/ThermoFluor using a physical library of small molecules.[98] These ligand specificities anchor the pathway by identifying the substrate for the first enzyme; the Pfam families of the neighbors allow the reactions to be predicted. Experiments, both in vitro and in vivo, are required to validate the pathway. Using this strategy, experimentally determined ligands for SBPs and synergistic use of SSNs and GNNs to identify pathway components, the EFI identified several novel catabolic pathways. A particularly informative example is the discovery of catabolic pathways for the three tetritols, d-threitol, l-threitol, and erythritol, in Mycobacterium smegmatis.[34] Ligand screening identified one SBP for an ABC transporter that bound d-threitol; a genome-proximal dehydrogenase catalyzed its oxidation; however, other catabolic enzymes were encoded elsewhere in the genome (Figure A). These “missing” enzymes were discovered by first constructing the SSN for the d-threitol dehydrogenase and then the GNN for the cluster containing the dehydrogenase—this identified a d-erythrulose kinase that was encoded by a gene cluster distal to the one containing the SBP and d-threitol dehydrogenase in M. smegmatis (but not other species that encode the pathway). The SSN for the kinase family was then constructed, and the cluster containing the d-erythrulose kinase was used to construct the GNN; this identified a second gene cluster distal to both the one containing the SBP and d-threitol dehydrogenase and the one containing the d-erythrulose kinase that contained isomerases to complete the d-threitol pathway. Investigation of other genes in both distal clusters allowed identification of the remaining enzymes in the pathway for d-threitol catabolism as well as the enzymes in the pathways for l-threitol and erythritol catabolism (Figure B). The ligand specificity of a single SBP was sufficient to identify enzymes for three catabolic pathways encoded by three distal gene clusters.

Figure 8

(A) Strategy for discovering catabolic pathways for d-threitol, l-threitol, and erythritol in M. smegmatis using differential scanning fluorimetry (DSF) to screen the ligand specificities of SBPs and the integrated used of SSNs and GNNs to discover the pathway enzymes. (B) Catabolic pathways for d-threitol, l-threitol, and erythritol. (C) Catabolic pathways for d-threonate, l-threonate, and d-erythronate in R. eutropha H16.[59] Figures in Panel A and B reproduced with permission from ref (34); figure in Panel C reproduced with permission from ref (59). The EFI also used this strategy to assign functions to members of Domain of Unknown Function 1537 (DUF 1537; approximately 20% of the 16 712 Pfam families in Release 31.0 are families of DUFs or proteins of unknown function).[59] Using the specificities for four SBPs for TRAP transport systems for four-carbon acid sugars, including d-erythronate and l-erythronate, SSNs and GNNs were used to identify two genome neighborhoods in Ralstonia eutropha H16 that encode enzymes in catabolic pathways for d-threonate, l-threonate, and d-erythronate (Figure C). Members of the DUF1537 family (Pfam families PF07005 and PF17402) were determined to be kinases for four-carbon acid sugars, identifying a previously uncharacterized family of kinases. In addition, members of the PdxA2 family (PF04166) were determined to be oxidative decarboxylases that generate dihydroxyacetone phosphate (DHAP) and CO2. In unpublished work, the specificities of three ABC SBPs for d-apiose, a branched chain pentose found in plant cell walls, and the iterative use of SSNs and GNNs have been used to discover five catabolic pathways for d-apiose, a branched aldose, two of which are found in species in the human gut microbiome (humans ingest plant cell walls; species of Bacteroides can degrade the rhamnogalacturonan-II component that contains d-apiose to release d-apiose that can be catabolized[99]). Two pathways include novel RuBisCO-like proteins (RLPs) from the RuBisCO superfamily, one catalyzes a β-ketoacid decarboxylation and the second catalyzes a “transcarboxylation” in which the substrate is decarboxylated (β-ketoacid decarboxylation), with the sequestered CO2 used to carboxylate the enediolate intermediate on the adjacent carbon, and the resulting isomeric β-ketoacid undergoes hydrolysis as in the canonical RuBisCO reaction. The experimentally determined specificity of three SBPs anchored discovery of five pathways by identifying the substrates; the iterative use of SSNs and GNNs identified the enzymes.

Comments

The success of the integrated application of SSNs and GNNs to discover metabolic pathways is limited by the proximities of the genes encoding the pathway components, so this analysis may not be successful for all functional assignment problems. However, the large-scale nature of the analyses provides the potential to determine whether colocalization of genes is due to limited genetic drift among similar genomes or pathway conservation among phylogenetically diverse genomes; it also allows identification of low co-occurrence frequency but significant clustering of the genes encoding multiple pathway components that would be tedious to discover by examination of large numbers of individual genome neighborhoods.[34] Also, SSNs provide the ability to segregate members of mechanistically diverse superfamilies and functionally diverse suprafamilies into isofunctional clusters (families). For enzymes an important test of isofunctionality is that the GNN generated for an SSN cluster identifies the components of a single pathway. The iterative use of SSNs and GNNs not only provides a test of isofunctionality but also a method for determining the minimum SSN alignment score required to achieve isofunctionality. If the GNN for an SSN cluster identifies “too many” components for a single pathway, further segregation of the cluster with a larger alignment score into “daughter” clusters may allow the resolution of the pathways. The reader should recognize that achieving isofunctional clusters in an SSN may not be straightforward, e.g., even within the same superfamily different alignment scores may be required to achieve isofunctional clusters. However, the integration of SSNs and GNNs using EFI-EST and EFI-GNT provides a powerful strategy for assessing and achieving isofunctional clusters.

Chemically Guided Functional Profiling: Building on EFI-EST

With ∼50% of the proteins in the sequence databases having incorrect, uncertain, or unknown functions, devising a target selection strategy is a major challenge for functional assignment. The SSNs for functionally diverse enzyme families often have many uncharacterized clusters—the problem is deciding which are worth experimental characterization. One approach is to select those that are most biologically relevant, but how is that achieved in the absence of knowledge of their functions? Balskus and Huttenhower recently described a strategy for choosing biologically relevant targets termed “chemically guided functional profiling”.[72] This strategy involves (1) construction of the SSN for a targeted protein family segregated into isofunctional families and (2) mapping the abundance of metagenome reads to the clusters in the SSN, with uncharacterized clusters having the largest number of metagenome markers the highest priority for functional characterization (Figure A). ShortBRED[100] provides a fast and accurate method to profile metagenome samples and uses sequence fragments from the clusters in the SSN (“markers’) to identify homologous sequences in the metagenome reads; their abundance is then mapped to the SSN clusters to accomplish target selection.

Figure 9

(A) Strategy for chemically guided functional profiling. (B) SSN for the glycyl radical enzyme superfamily showing clusters with previously assigned functions as well as clusters (15 and 16) for which chemically guided functional profiling was used to leverage experimental functional assignment. Figures reproduced with permission from ref (72). The utility of chemically guided functional profiling was demonstrated using the glycyl radical enzyme (GRE) superfamily; the reactions are initiated by abstraction of a hydrogen atom from the substrate by a glycine-centered backbone radical (generated by an activase from the S-adenosyl methionine superfamily). The metagenome samples used for target selection were from the human gut microbiome, so uncharacterized members of the GRE superfamily are likely involved in reactions that allow the microbiome to utilize small molecules in the gut. Balskus previously had identified choline trimethylamine-lyase (CutC) in human gut microbiome species; CutC catalyzes the cleavage of choline to acetaldehyde and trimethylamine, the latter involved in the production of methane as well as implicated in human diseases via its N-oxide.[101,102] The SSN for the GRE family is shown in Figure B. The functionally assigned clusters are colored, as are two clusters (15 and 16) that were identified as abundant in the human gut microbiome. Both of the latter clusters were hypothesized to be dehydratases based on conserved active site residues associated with known dehydratase reactions. Cluster 15 was characterized as a 4-hydroxyproline dehydratase; again, genome context was used to predict the substrate because of its proximity to Δ1-pyrroline-5-carboxylate (P5C) reductase that reduces P5C that would be derived from dehydration of 4-hydroxyproline to proline. Cluster 16 was characterized as a novel (S)-1,2-propanediol dehydratase (a previously characterized analogue is an adenosylcobalamin-dependent enzyme); the identity of the substrate was suggested from genome analysis because the enzyme is found in Roseburia inulinivorans that catabolizes l-fucose but lacks the adenosylcobalamin-dependent dehydratase. A “user friendly” web tool is not yet available to allow the community to use “chemically guided functional profiling” with their favorite families. But, the development of a web tool is a high priority goal given its ability to identify important targets for functional characterization.

AGeNNT and Refined GNNs: Building on EFI-GNT

EFI-GNT provides GNNs in two formats that summarize (1) the Pfam families identified by each SSN cluster (edges between SSN cluster hub-nodes and Pfam family spoke-nodes), providing information about the reactions in metabolic pathways, and (2) the SSN clusters that identify each Pfam family (edges between Pfam family hub-nodes and SSN cluster spoke-nodes), providing information about whether multiple clusters may contain orthologues. Merkl and co-workers recently described AGeNNT (Automatically Generates refined Neighborhood NeTworks), a Java application that uses the GNNs provided by EFI-GNT to generate a third format (“refined GNN”) in which all of the SSN cluster and Pfam family nodes are connected by edges.[71] Clusters that contain orthologues, identified when they share the same genome neighbors, can be distinguished from clusters that have different genome contexts. An SSN is submitted to the EFI-GNT web tool. AGeNNT then generates the refined GNN. Several options are provided, including (1) eliminating overrepresented phylogenetically related subspecies from the input SSN to reduce redundancy in the GNN and (2) using a user-defined “whitelist” of Pfam families to include in the refined GNN. For example, only Pfam families for enzymes can be included in the refined GNN so Pfam cluster connections between SSN clusters that involve transporters and transcriptional regulators are eliminated (in contrast to pathway enzymes, transporters and transcriptional regulators are not conserved). Continuing again with the proline racemase family (PF05544) to provide an example, several major clusters from the SSN were selected for generation of GNNs using EFI-GNT and the refined GNN using AGeNNT (Figure ). The colored SSN is shown in Figure A, the SSN cluster hub-node GNN format is shown in Figure B, the Pfam family hub-node GNN format is shown in Figure C, and the refined GNN is shown in Figure D (Pfam families for transport systems and transcriptional regulators are deleted in the GNNs; because these families are not conserved in pathways (vide supra), their inclusion in the refined GNN can complicate the analysis). Comparison of the refined GNN with the GNNs establishes the utility of the refined GNN in identifying orthologous SSN clusters: clusters 2, 4, 5, and 6 are orthologous 4-hydroxyproline epimerases; clusters 1 and 3 are orthologous trans-3-hydroxylproline dehydratases; and cluster 7 is proline racemase (using functional assignments based on experimental verification[97]). Building on EFI-EST and EFI-GNT, AGeNNT links SSN clusters that share pathway context, potentially identifying interrelations of subfamilies within a protein family.

Figure 10

(A) Colored SSN generated by EFI-GNT for selected clusters in the proline racemase family (PF05544). (B) GNN with SSN cluster hub-nodes and Pfam family spoke-nodes. (C) GNN with Pfam family hub-nodes and SSN cluster spoke-nodes. (D) Refined GNN showing identification of three different functions as deduced by connections (or lack thereof) between SSN cluster and Pfam family nodes.

Future Directions

EFI-EST and EFI-GNT provide experimentalists with otherwise inaccessible but essential perspectives on sequence–function space in protein families and genome context that facilitate the assignment of functions to uncharacterized enzymes. Other web tools are available for smaller scale analysis of protein families, but genomic enzymology “requires” large-scale analyses to provide the maximum amount of context. Other large-scale web tools can be imagined. For example, the proteome of an organism (or of a community) determines its metabolic capabilities; therefore, an easy-to-construct overview of the metabolic potential would be useful and could be provided by a “proteome network” (PN) tool. A PN would include a node for each protein encoded by a genome (or community) and collected into Pfam family clusters (Pfam family hub-node and protein spoke nodes). The PN would identify the catalytic capabilities via the identities of the Pfam families and, also, the locations of the proteins (spoke nodes) in the SSNs for their families. For a community PN, identification of species-specific Pfam families could provide the potential to identify syntrophic metabolic pathways, e.g., different organisms contribute different metabolic capabilities to synthesize a natural product or degrade an energy source. In analogy with chemically guided functional profiling, mapping transcriptome abundance to the PN would provide a visually powerful approach for identifying enzymes in novel pathways. Also, the Pfam families that contribute enzymes to a pathway often are conserved in phylogenetically diverse organisms; however, we have observed that one or more reactions in a metabolic pathway can be catalyzed by analogues (nonorthologous gene replacements) in different taxonomic ranks, e.g., phyla, class, order, or family. The ability to discover analogues may be enhanced by clustering members of a protein family by taxonomic rank instead of pairwise sequence identity (SSNs). Because the node attributes that are provided by EFI-EST for sequences include taxonomic ranking, a taxonomic rank network (“TRN”) would be easy to construct. Subsequent generation of sequence similarity-based SSNs for individual clusters in the TRN would be accomplished with Option D of EFI-EST, thereby providing the ability to further segregate and analyze the clusters by sequence homology. Finally, although the generation of an SSN is straightforward, Release 31.0 of the Pfam database (Release 31.0) defines 16 712 families. Immediate access to a library of precomputed SSNs for all Pfam families would provide the biological and biomedical communities, including users of web tools that identify BGCs (vide supra), with the ability to quickly place their favorite enzymes in the context sequence–function relationships for their protein families. This library of SSNs should be regularly updated to provide current information (perhaps in parallel with releases of the InterPro database), but its construction requires considerable computational resources. We have demonstrated that the calculation of this database is feasible, although we have not yet been able to initiate the production phase of this effort. I encourage the readers to (1) try the EFI-EST and EFI-GNT web tools, (2) imagine new applications for SSNs and GNNs, and (3) identify additional large-scale data visualization and analysis challenges that would be amenable to solution by community-accessible web tools. Like the natural products community, the enzymology community needs to recognize the essential role of web tools that allow the protein and genome sequence databases to be leveraged for the solution of biological problems.

100 in total

1. Pclust: protein network visualization highlighting experimental data.

Authors: Wenlin Li; Lisa N Kinch; Nick V Grishin
Journal: Bioinformatics Date: 2013-08-05 Impact factor: 6.937

2. FastTree 2--approximately maximum-likelihood trees for large alignments.

Authors: Morgan N Price; Paramvir S Dehal; Adam P Arkin
Journal: PLoS One Date: 2010-03-10 Impact factor: 3.240

3. Complete genome sequence of the model actinomycete Streptomyces coelicolor A3(2).

Authors: S D Bentley; K F Chater; A-M Cerdeño-Tárraga; G L Challis; N R Thomson; K D James; D E Harris; M A Quail; H Kieser; D Harper; A Bateman; S Brown; G Chandra; C W Chen; M Collins; A Cronin; A Fraser; A Goble; J Hidalgo; T Hornsby; S Howarth; C-H Huang; T Kieser; L Larke; L Murphy; K Oliver; S O'Neil; E Rabbinowitsch; M-A Rajandream; K Rutherford; S Rutter; K Seeger; D Saunders; S Sharp; R Squares; S Squares; K Taylor; T Warren; A Wietzorrek; J Woodward; B G Barrell; J Parkhill; D A Hopwood
Journal: Nature Date: 2002-05-09 Impact factor: 49.962

4. Evolutionary potential of (beta/alpha)8-barrels: in vitro enhancement of a "new" reaction in the enolase superfamily.

Authors: Jacob E Vick; Dawn M Z Schmidt; John A Gerlt
Journal: Biochemistry Date: 2005-09-06 Impact factor: 3.162

5. Assignment of pterin deaminase activity to an enzyme of unknown function guided by homology modeling and docking.

Authors: Hao Fan; Daniel S Hitchcock; Ronald D Seidel; Brandan Hillerich; Henry Lin; Steven C Almo; Andrej Sali; Brian K Shoichet; Frank M Raushel
Journal: J Am Chem Soc Date: 2013-01-02 Impact factor: 15.419

6. Functional annotation and three-dimensional structure of an incorrectly annotated dihydroorotase from cog3964 in the amidohydrolase superfamily.

Authors: Argentina Ornelas; Magdalena Korczynska; Sugadev Ragumani; Desigan Kumaran; Tamari Narindoshvili; Brian K Shoichet; Subramanyam Swaminathan; Frank M Raushel
Journal: Biochemistry Date: 2012-12-20 Impact factor: 3.162

7. Deamination of 6-aminodeoxyfutalosine in menaquinone biosynthesis by distantly related enzymes.

Authors: Alissa M Goble; Rafael Toro; Xu Li; Argentina Ornelas; Hao Fan; Subramaniam Eswaramoorthy; Yury Patskovsky; Brandan Hillerich; Ron Seidel; Andrej Sali; Brian K Shoichet; Steven C Almo; Subramanyam Swaminathan; Martin E Tanner; Frank M Raushel
Journal: Biochemistry Date: 2013-09-04 Impact factor: 3.162

8. A gold standard set of mechanistically diverse enzyme superfamilies.

Authors: Shoshana D Brown; John A Gerlt; Jennifer L Seffernick; Patricia C Babbitt
Journal: Genome Biol Date: 2006-01-31 Impact factor: 13.583

9. antiSMASH 4.0-improvements in chemistry prediction and gene cluster boundary identification.

Authors: Kai Blin; Thomas Wolf; Marc G Chevrette; Xiaowen Lu; Christopher J Schwalen; Satria A Kautsar; Hernando G Suarez Duran; Emmanuel L C de Los Santos; Hyun Uk Kim; Mariana Nave; Jeroen S Dickschat; Douglas A Mitchell; Ekaterina Shelest; Rainer Breitling; Eriko Takano; Sang Yup Lee; Tilmann Weber; Marnix H Medema
Journal: Nucleic Acids Res Date: 2017-07-03 Impact factor: 16.971

10. Large-Scale Analysis Exploring Evolution of Catalytic Machineries and Mechanisms in Enzyme Superfamilies.

Authors: Nicholas Furnham; Natalie L Dawson; Syed A Rahman; Janet M Thornton; Christine A Orengo
Journal: J Mol Biol Date: 2015-11-14 Impact factor: 5.469

52 in total

1. The EFI Web Resource for Genomic Enzymology Tools: Leveraging Protein, Genome, and Metagenome Databases to Discover Novel Enzymes and Metabolic Pathways.

Authors: Rémi Zallot; Nils Oberg; John A Gerlt
Journal: Biochemistry Date: 2019-10-04 Impact factor: 3.162

2. Identification of a Functionally Unique Family of Penicillin-Binding Proteins.

Authors: Michael A Welsh; Atsushi Taguchi; Kaitlin Schaefer; Daria Van Tyne; François Lebreton; Michael S Gilmore; Daniel Kahne; Suzanne Walker
Journal: J Am Chem Soc Date: 2017-11-30 Impact factor: 15.419

3. Structure-guided function discovery of an NRPS-like glycine betaine reductase for choline biosynthesis in fungi.

Authors: Yang Hai; Arthur M Huang; Yi Tang
Journal: Proc Natl Acad Sci U S A Date: 2019-05-06 Impact factor: 11.205

4. Functional assignment of multiple catabolic pathways for D-apiose.

Authors: Michael S Carter; Xinshuai Zhang; Hua Huang; Jason T Bouvier; Brian San Francisco; Matthew W Vetting; Nawar Al-Obaidi; Jeffrey B Bonanno; Agnidipta Ghosh; Rémi G Zallot; Harvey M Andersen; Steven C Almo; John A Gerlt
Journal: Nat Chem Biol Date: 2018-06-04 Impact factor: 15.040

5. Discovery of the Tiancilactone Antibiotics by Genome Mining of Atypical Bacterial Type II Diterpene Synthases.

Authors: Liao-Bin Dong; Jeffrey D Rudolf; Ming-Rong Deng; Xiaohui Yan; Ben Shen
Journal: Chembiochem Date: 2018-05-27 Impact factor: 3.164

6. Biosynthesis of GDP-d-glycero-α-d-manno-heptose for the Capsular Polysaccharide of Campylobacter jejuni.

Authors: Jamison P Huddleston; Frank M Raushel
Journal: Biochemistry Date: 2019-08-29 Impact factor: 3.162

Review 7. 'Democratized' genomic enzymology web tools for functional assignment.

Authors: Rémi Zallot; Nils O Oberg; John A Gerlt
Journal: Curr Opin Chem Biol Date: 2018-09-27 Impact factor: 8.822

8. Stereodivergent, Chemoenzymatic Synthesis of Azaphilone Natural Products.

Authors: Joshua B Pyser; Summer A Baker Dockrey; Attabey Rodríguez Benítez; Leo A Joyce; Ren A Wiscons; Janet L Smith; Alison R H Narayan
Journal: J Am Chem Soc Date: 2019-11-06 Impact factor: 15.419

9. Molecular basis for enantioselective herbicide degradation imparted by aryloxyalkanoate dioxygenases in transgenic plants.

Authors: Jonathan R Chekan; Chayanid Ongpipattanakul; Terry R Wright; Bo Zhang; J Martin Bollinger; Lauren J Rajakovich; Carsten Krebs; Robert M Cicchillo; Satish K Nair
Journal: Proc Natl Acad Sci U S A Date: 2019-06-17 Impact factor: 11.205

10. MbnH is a diheme MauG-like protein associated with microbial copper homeostasis.

Authors: Grace E Kenney; Laura M K Dassama; Anastasia C Manesis; Matthew O Ross; Siyu Chen; Brian M Hoffman; Amy C Rosenzweig
Journal: J Biol Chem Date: 2019-09-11 Impact factor: 5.157