John A Gerlt1. 1. Departments of Biochemistry and Chemistry, Institute for Genomic Biology, University of Illinois , Urbana-Champaign Urbana, Illinois 61801, United States.
Abstract
The exponentially increasing number of protein and nucleic acid sequences provides opportunities to discover novel enzymes, metabolic pathways, and metabolites/natural products, thereby adding to our knowledge of biochemistry and biology. The challenge has evolved from generating sequence information to mining the databases to integrating and leveraging the available information, i.e., the availability of "genomic enzymology" web tools. Web tools that allow identification of biosynthetic gene clusters are widely used by the natural products/synthetic biology community, thereby facilitating the discovery of novel natural products and the enzymes responsible for their biosynthesis. However, many novel enzymes with interesting mechanisms participate in uncharacterized small-molecule metabolic pathways; their discovery and functional characterization also can be accomplished by leveraging information in protein and nucleic acid databases. This Perspective focuses on two genomic enzymology web tools that assist the discovery novel metabolic pathways: (1) Enzyme Function Initiative-Enzyme Similarity Tool (EFI-EST) for generating sequence similarity networks to visualize and analyze sequence-function space in protein families and (2) Enzyme Function Initiative-Genome Neighborhood Tool (EFI-GNT) for generating genome neighborhood networks to visualize and analyze the genome context in microbial and fungal genomes. Both tools have been adapted to other applications to facilitate target selection for enzyme discovery and functional characterization. As the natural products community has demonstrated, the enzymology community needs to embrace the essential role of web tools that allow the protein and genome sequence databases to be leveraged for novel insights into enzymological problems.
The exponentially increasing number of protein and nucleic acid sequences provides opportunities to discover novel enzymes, metabolic pathways, and metabolites/natural products, thereby adding to our knowledge of biochemistry and biology. The challenge has evolved from generating sequence information to mining the databases to integrating and leveraging the available information, i.e., the availability of "genomic enzymology" web tools. Web tools that allow identification of biosynthetic gene clusters are widely used by the natural products/synthetic biology community, thereby facilitating the discovery of novel natural products and the enzymes responsible for their biosynthesis. However, many novel enzymes with interesting mechanisms participate in uncharacterized small-molecule metabolic pathways; their discovery and functional characterization also can be accomplished by leveraging information in protein and nucleic acid databases. This Perspective focuses on two genomic enzymology web tools that assist the discovery novel metabolic pathways: (1) Enzyme Function Initiative-Enzyme Similarity Tool (EFI-EST) for generating sequence similarity networks to visualize and analyze sequence-function space in protein families and (2) Enzyme Function Initiative-Genome Neighborhood Tool (EFI-GNT) for generating genome neighborhood networks to visualize and analyze the genome context in microbial and fungal genomes. Both tools have been adapted to other applications to facilitate target selection for enzyme discovery and functional characterization. As the natural products community has demonstrated, the enzymology community needs to embrace the essential role of web tools that allow the protein and genome sequence databases to be leveraged for novel insights into enzymological problems.
In 2001 Patricia Babbitt and
I discussed nature’s strategies for divergent evolution of
new enzymatic functions from a common progenitor to yield mechanistically
diverse enzyme superfamilies (conserved active site architectures
that catalyze reactions with shared partial reactions, intermediates,
or transition states) and functionally diverse suprafamilies (conserved
active site architectures that catalyze mechanistically distinct reactions).[1] When our review was published, only a few superfamilies/suprafamilies
had been recognized, including the enolase, amidohydrolase, thiyl
radical, enoyl-CoA hydratase (crotonase), vicinal-oxygen-chelate superfamilies,
and the orotidine 5′-monophosphate (OMP) decarboxylase suprafamily,
not surprising because the UniProt database then contained only 571 804
protein sequences (July 2001) (http://www.uniprot.org/; see Table for a summary of abbreviations). Despite,
in retrospect, a meager number of sequences, we concluded that enzymologists
were positioned to expand their interests beyond studies of single
enzymes to encompass entire enzyme families. We proposed that sequenced
genomes (1) provided a rapidly expanding source of new proteins for
investigation and (2) allowed genomic context to be used to infer
novel enzymatic functions and, therefore, better understand the evolution
of functional diversity in enzyme superfamilies. We suggested the
term genomic enzymology to describe the expansive
strategy of using protein families and genome context to focus studies
of enzyme mechanisms, discover new functions, and more accurately
describe the evolution of enzyme function in molecular terms (sequence
and structure). However, we did not propose how the protein and genome
sequence databases could be leveraged and used by the experimental
community.
Sixteen years later, the UniProt database
contains 88 588 026
nonredundant sequences (Figure ; Release 2017_07); the number of sequences is increasing
at the rate of 2.4% per month (doubling time 2.5 years), largely the
result of microbial genome projects. The challenge is to devise “user
friendly” methods to interrogate the massive amount of data
so that hypotheses can be generated that direct experimental determination
of in vitro activities and in vivo metabolic functions of uncharacterized enzymes. For example, 379
mechanistically diverse superfamilies and functionally diverse suprafamilies
have been described;[2] additional superfamilies
and suprafamilies must be present in (1) genomic “dark matter”
that has not been curated by databases such as Pfam and (2) the genomes
of phylogenetically diverse bacterial species that have not yet been
systematically sequenced.[3] This large,
and growing for the foreseeable future, set of superfamilies includes
members that catalyze novel reactions in novel pathways, a boon to
enzymologists.
Figure 1
Growth of the UniProt protein sequence database (Release
2017_07).
The blue line represents the EMBL/TrEMBL sequences with automated
annotations; the red line represents the EMBL/SwissProt with manually
curated annotations. Currently, the doubling time is ∼2.5 years.
The number of sequences decreased by ∼50% in April 2015 when
UniProt identified reference proteomes for closely related species
and archived the redundant proteomes.
Growth of the UniProt protein sequence database (Release
2017_07).
The blue line represents the EMBL/TrEMBL sequences with automated
annotations; the red line represents the EMBL/SwissProt with manually
curated annotations. Currently, the doubling time is ∼2.5 years.
The number of sequences decreased by ∼50% in April 2015 when
UniProt identified reference proteomes for closely related species
and archived the redundant proteomes.Approximately 50% of the proteins in the databases have incorrect,
uncertain, or unknown functional annotations.[4] The UniProt Knowledgebase (UniProtKB) is composed of two sections,
UniProtKB/SwissProt and UniProtKB/TrEMBL. The annotations in UniProtKB/SwissProt
are manually curated; the functional annotations in UniProtKB/TrEMBL
are computationally assigned based on the function of the “closest”
homologue. In the most recent UniProt release (2017_07), only 0.63%
of the sequences are in the UniProtKB/SwissProt section (Figure ); this fraction
continues to decrease because the total number of sequences added
in each release greatly exceeds the number of new sequences with SwissProt-curated,
experimentally verified annotations. In principle, curated annotations
might be extended to orthologues; however, the sequence boundaries
between functions are unknown, so homology-based approaches for functional
assignment are risky. Therefore, incorrect, uncertain, or unknown
annotations will continue to propagate, compromising their utility
to allow the discovery of new enzymatic functions, metabolic pathways,
metabolites, and biology.Khosla recently summarized this challenge:[5] “Although enzymology will remain a predominantly
experimental
science for the foreseeable future, one cannot avoid a sense of helplessness
when one considers the huge (and growing) deficit in functionally
annotated sequences. By now, there are approximately 100 million nonredundant
protein sequence entries in GenBank, but a reliably curated protein
database such as SwissProt contains fewer than 1 million entries.
This is a quintessential ‘big data’ problem, where the
rate at which data is generated continues to outpace the rate at which
it is curated. It is unlikely that more resource-intensive curation
alone can solve the problem. As the proverb says, this may be a situation
where the most desirable approach will involve user-friendly tools
that teach a novice how to fish instead of serving fish. Such tools
could ideally capture the essence of an enzymologist’s judgment
in layers of increasing sophistication, depending on the user’s
actual needs.”This Perspective describes “genomic
enzymology” web
tools that initially were developed by the Enzyme Function Initiative
(EFI)[6] and provides examples of their applications.
Web Tools
for Natural Product Discovery
In parallel
with the development of genomic enzymology, the natural products community
discovered that genes encoding biosynthetic pathways for natural products
often are organized in “biosynthetic gene clusters”
(BGCs).[7−9] Given the structural complexity of natural products
and the need to identify the enzymes that assemble their backbones,
e.g., terpene synthases, nonribosomal peptide synthases (NRPSs), and
polyketide synthases (PKSs), as well as the enzymes that catalyze
“tailoring” reactions, e.g., glycosylases, methylases,
and redox enzymes, the genomic colocalization of the biosynthetic
genes facilitates pathway discovery and experimental characterization.
Although the type of scaffold may be apparent from the annotations
in the BGCs, the structure of the natural product is not trivial to
predict. Indeed, many enzymes (backbone-forming and tailoring) are
novel members of diverse enzyme superfamilies. Nonetheless, the discovery
of a BGC facilitates enzyme identification so that they can be experimentally
tested for sequential activities in the biosynthetic pathway.The number of natural products is estimated to be extremely large;[10,11] therefore, identification of BGCs is an attractive strategy for
their discovery. In the past several years, bioinformatic tools have
been developed for discovering BGCs in sequenced genomes,[12,13] including antiSMASH (Antibiotics & Secondary Metabolite Analysis
SHell[14]), PRISM (PRediction Informatics
for Secondary Metabolomes[15]), and RODEO
(Rapid ORF Description and Evaluation Online[16]). These tools are widely used by the natural products/synthetic
biology community, e.g., more than 300 000 jobs have been processed
by the antiSMASH server (https://antismash.secondarymetabolites.org/). Although these tools enable the discovery of BGCs, the annotations
of the uncharacterized enzymes in the BGCs are limited to their membership
in protein families, an overview that often is insufficient to restrict
substrate specificities and/or reaction identities/mechanisms. Therefore,
many of the challenges in BGC characterization are the same as those
encountered by enzymologists focused on small-molecule metabolic pathways
(vide infra).
What Should Genomic Enzymology
Tools Provide?
Genomic
enzymology focuses on the discovery of function in the context of
entire enzyme families: this approach allows recognition of sequence
and structure attributes that are conserved for specific functions.
Babbitt developed the Structure–Function Linkage Database (SFLD; http://sfld.rbvi.ucsf.edu/) to generate and disseminate sequence–structure relationships
that associate specific functional properties with specific sequence
and structure motifs in functionally diverse enzyme superfamilies.[17] As an early example of the use of genomic enzymology
to obtain mechanistic insights, the recognition that (1) the reactions
catalyzed by mandelate racemase and muconate lactonizing enzyme in
the enolase superfamily require stabilization of an enolate anion
intermediate and (2) their sequences have conserved motifs for binding
an active site Mg2+ defined the catalytic strategy for
the superfamily.[1,18,19] The functional diversity in the superfamily, including dehydration,
deamination, cycloisomerization, racemization, and epimerization of
carboxylate-anion substrates, could be explained by divergent evolution
selecting (1) acid/base catalysts for both generating the enolate
anion intermediate and directing it to products and (2) specificity
determinants for binding different substrates in productive geometries
relative to the acid/base catalysts.[20,21] This same
strategy for evolution of new enzymatic functions applies to many
mechanistically diverse superfamilies.[2]The challenges for genomic enzymology are developing and applying large-scale methods for (1) grouping members of mechanistically
diverse superfamilies and functionally diverse suprafamilies in isofunctional
families, e.g., identifying acid/base catalysts and placing restrictions
on reaction mechanisms and substrate specificities and (2) analyzing
the genome contexts for the members of isofunctional families so that
their roles in metabolic pathways can be deduced. e.g., predicting
substrates, intermediates, and products.
Sequence Similarity Networks
(SSNs)
Evolutionary biologists
typically use phylogenetics-based approaches to distinguish orthologues
from paralogues.[22,23] Phylogenetic trees are constructed
from multiple sequence alignments (MSAs); however, MSAs are difficult
to generate for large protein families.[23] Many superfamilies and suprafamilies are large: >15 K sequences
in the glycyl-radical enzyme superfamily, >22 K sequences in the
OMP
decarboxylase suprafamily, >44 K sequences in the enolase superfamily,
>122 K sequences in the enoyl-CoA hydratase (crotonase) superfamily,
and >250 K sequences in the radical SAM superfamily. In addition
to
being difficult to construct, trees for large families also are difficult
to interpret because of their complexity.[24] Trees do not provide immediate access to all sequences in a family—representative
sequences usually are selected in the construction of the tree. Instead,
what is needed is a large-scale approach that allows easy visualization
and analyses for all sequences in a family, recognizing that it must
be “user friendly”, i.e., intuitive and fast.Atkinson and Babbitt introduced sequence similarity networks (SSNs)
to enable large-scale analyses of sequence–function relationships
in protein families.[25] An SSN displays
pairwise relationships obtained from an all-by-all sequence comparison,
e.g., BLAST. Although the use of BLAST can be criticized because it
provides a measure of overall sequence similarity and, therefore,
may be insensitive to different domain architectures important in
determining molecular function, it is (1) fast, a requirement for
routine all-by-all comparisons of the sequences of members of increasingly
large protein families (each sequence must be compared with every
other sequence so the time required increases with the square of the
number of sequences), and (2) familiar to experimentalists. An SSN
contains “nodes” for sequences; “edges”
that quantitate sequence similarity (pairwise sequence identity) connect
nodes that share sequence similarity that exceeds a user-specified
level (Figure ). As
the sequence similarity required to connect nodes with edges is increased,
the nodes segregate into clusters; the goal is to select a level of
sequence similarity that segregates the nodes/members of the family
into isofunctional clusters (Figure ).
Figure 2
A sequence similarity network (SSN) showing the protein
sequence
nodes and pairwise sequence similarity edges.
Figure 3
SSNs for sequences from the proline racemase family (Pfam family
PF05544). (A) Alignment score ≥15, ≥22% pairwise sequence
identity. (B) Alignment score ≥20, ≥25% pairwise sequence
identity. (C) Alignment score ≥50, ≥35% sequence identity.
(D) Alignment score ≥70, ≥40% sequence identity. (E)
Alignment score ≥90, ≥48% sequence identity. (F) Alignment
score ≥110, ≥58% sequence identity. The colors in panel
F are used to color the nodes in panels A–E.
A sequence similarity network (SSN) showing the protein
sequence
nodes and pairwise sequence similarity edges.SSNs for sequences from the proline racemase family (Pfam family
PF05544). (A) Alignment score ≥15, ≥22% pairwise sequence
identity. (B) Alignment score ≥20, ≥25% pairwise sequence
identity. (C) Alignment score ≥50, ≥35% sequence identity.
(D) Alignment score ≥70, ≥40% sequence identity. (E)
Alignment score ≥90, ≥48% sequence identity. (F) Alignment
score ≥110, ≥58% sequence identity. The colors in panel
F are used to color the nodes in panels A–E.SSNs contain “node attributes”, including
functional
and phylogenetic information associated with each sequence/node, that
assist the user in analyzing sequence–function relationships,
including choosing sequence similarity thresholds for drawing edges
and segregating the families into isofunctional clusters. Atkinson
and Babbitt compared SSNs with phylogenetic trees and concluded “the
most valuable feature of SSNs is not the optimal or most accurate
display of sequence similarity, but rather the flexible visualization
of many alternate protein attributes for all or nearly all sequences
in a superfamily”.[25]SSNs
are viewed using Cytoscape (http://cytoscape.org/), “an open source platform for
visualizing complex networks and integrating these with attribute
data”.[26] Although Cytoscape has
a steep “learning curve”, it provides Control Panels
to select nodes based on the node attributes and to filter and color
the networks to enable visual analyses. With node attributes and the
Control Panels, SSNs viewed with Cytoscape satisfy Khosla’s
vision that genomic enzymology tools “could ideally capture
the essence of an enzymologist’s judgment in layers of increasing
sophistication, depending on the user’s actual needs”.[5]The SFLD provides SSNs for a several functionally
diverse superfamilies
with manually curated (labor intensive and expensive) annotations/node
attributes;[17] these SSNs serve as “gold
standards” for functional annotation in both the bioinformatics
and enzymology communities.[27] However,
with the large number of superfamilies/suprafamilies (vide
infra) and families that provide additional metabolic enzymes,
e.g., dehydrogenases, kinases, and aldolases, community-initiated
generation of SSNs is necessary. The SFLD does not provide this capability;
Pythoscape was developed by the SFLD for generating large SSNs, but
it is not “user friendly” for most experimentalists
because it requires access to a computer cluster and programming expertise.[28]In principle, the construction of SSNs
is “simple”,
i.e., connecting sequences with edges that quantitate similarity.
However, most experimentalists would be hard-pressed to develop their
own programs for generating SSNs. And, other web tools that construct
SSNs, e.g., Pclust[29] and CLANS,[30] use a limited number of sequences and/or node
attributes.The EFI developed a web tool, the Enzyme Function
Initiative-Enzyme
Similarity Tool (EFI-EST; http://efi.igb.illinois.edu/efi-est/),[31] to generate SSNs for large protein
families. To date, >1600 unique users have submitted jobs to EFI-EST,
and >50 publications have appeared that reference the use of EFI-EST.[13,14,32−78] EFI-EST uses sequences and node attribute information from UniProt:
in contrast to the NCBI database, annotations in the UniProt database
can be changed with data provided by any member of the community,
allowing important corrections and additions that diminish propagation
of annotation errors.EFI-EST now provides four options for
selecting sequences to be
included in the SSN: Option A, a single user-supplied sequence is
used to collect homologues with BLAST from the UniProt database (maximum
10 000 sequences); Option B, the user specifies one or more
UniProt and/or InterPro families [currently limited to ≤255,000
sequences to allow the SSN for the radical SAM superfamily (Pfam family
PF04055) to be generated]; Option C (enhanced in the most recent update),
the user provides a FASTA file of sequences and selects whether accession
IDs in the headers are used to retrieve node attributes from UniProt;
and Option D (new in the most recent update), the user provides a
list of UniProt and/or NCBI accession IDs. After the all-by-all comparison
using BLAST, the user selects an “alignment score” based
on pairwise percent identity to filter the edges (the threshold for
drawing edges to connect nodes). The user then downloads the SSN for
analysis with Cytoscape.EFI-EST now provides a “Color
SSN Utility” to facilitate
analyses of SSNs by (1) coloring each cluster in an input SSN with
a unique color, (2) providing a file with color information that allows
the user to color SSNs of the same sequences generated with lower
similarity (pairwise identity) to track segregation of clusters (e.g., Figure ), and (3) FASTA
files for the sequences in each cluster to facilitate the generation
of MSAs.
Applications of SSNs
The EFI used SSNs from the SFLD
to characterize sequence–function space in targeted functionally
diverse superfamilies (amidohydrolase,[79−85] enolase,[19,86−92] glutathione S-transferase,[93] haloalkanoate dehalogenase,[94] and isoprenoid
synthase[95,96]) and select targets for functional discovery.
Then, when EFI-EST became available, both the EFI and community began
to use SSNs to characterize sequence–function space in a wide
range of proteins families.SSNs generated by the community
using EFI-EST[13,14,32−78] have been used to identify and describe potential isofunctional
families within enzyme families, e.g., clusters with different (but
unknown) substrate specificities, thereby providing an overview of
sequence–function space in specificity diverse superfamilies
(different substrates but same type of overall reaction) and functionally
diverse superfamilies (different substrates and different reaction
mechanisms, although a partial reaction may be conserved). SSNs also
provide the ability to survey the members of a protein family for
different domain architectures that may suggest different functional
contexts, i.e., fusion proteins in different pathways. And, the pathway
for cluster segregation as sequence similarity increases (Figure ) may suggest functional
linkages between clusters. Several community-generated SSNs from the
recent literature that illustrate their use are shown in Figure ; readers are referred
to the publications for detailed descriptions.[13,14,32−78]
Figure 4
Examples
of SSNs generated with EFI-EST that were included in recent
publications. (A) SSN for isopeptidases involved in lasso peptide
synthesis.[43] (B) SSN of precursor peptides
for microviridin synthesis.[60] (C) SSN of
LanMs in lantibiotic synthesis.[76] (D) SSN
for ferredoxins compared with a phylogenetic tree.[40] (E) SSN for IspH in isoprenoid biosynthesis.[56] (F) SSNs for members of the DRE-TIM metallolyase
superfamily.[52] Figures reproduced with
permission from refs (40), (43), (52), (56), (60), and (76).
Examples
of SSNs generated with EFI-EST that were included in recent
publications. (A) SSN for isopeptidases involved in lasso peptide
synthesis.[43] (B) SSN of precursor peptides
for microviridin synthesis.[60] (C) SSN of
LanMs in lantibiotic synthesis.[76] (D) SSN
for ferredoxins compared with a phylogenetic tree.[40] (E) SSN for IspH in isoprenoid biosynthesis.[56] (F) SSNs for members of the DRE-TIM metallolyase
superfamily.[52] Figures reproduced with
permission from refs (40), (43), (52), (56), (60), and (76).
Genome Neighborhood Networks (GNNs)
With the potential
to segregate protein families into isofunctional clusters using SSNs,
the second genomic enzymology challenge is to place these clusters
in a functional context, e.g., identify the small-molecule metabolic
pathways in which uncharacterized enzymes participate. In eubacteria,
archaea, and fungi, the enzymes in a metabolic pathway often are encoded
by a gene cluster or operon (just as the biosynthetic pathways for
natural products are encoded by BGCs). Therefore, the proteins encoded
by the genes proximal to those that encode members of an isofunctional
cluster (orthologues) may allow the number and types of reactions
in the metabolic pathway to be determined if these are conserved by
the members of the cluster.Genome neighborhoods for homologues
can be examined using web resources such as JGI-IMG/M (https://img.jgi.doe.gov/cgi-bin/m/main.cgi); however, complete pathways are not always encoded by a single
genome neighborhood. Large-scale mining of genome neighborhoods for
all orthologues in an SSN cluster has the advantage that operon/gene
cluster organization may not be preserved across phylogenetic species;
i.e., the sequences in an isofunctional SSN cluster may have diverse
genome neighborhoods and pathway neighbors, but the ability to survey
all of the neighborhoods provides the potential to identify all of
the functionally linked genes/enzymes that can be assembled into a
metabolic pathway.In 2014, the EFI described a genome neighborhood
analysis that
was applied to the proline racemase family (Pfam family PF05544) using
an all-by-all comparison (with BLAST) of the neighbors to generate
a network (the genome neighborhood network, GNN);[97] the neighbors were segregated into protein families using
an e-value >20 for the edges in the SSN. By assigning unique colors
to the clusters in the SSN (Figure A) and coloring the neighbors in the GNN with the same
color, the neighbors for the sequences in each cluster were identified
(Figure B). Then,
candidates for functionally linked enzymes were recognized and potential
pathways were predicted. This analysis allowed in vitro enzymatic activities and in vivo metabolic functions
(the three pathways shown in Figure C) to be assigned to 85% of the sequences in the family
[2333 sequences in InterPro Release 43.0 (July 2013)].
Figure 5
(A) A colored SSN for
the proline racemase family (PF05544; InterPro
Release 43.0). (B) The GNN generated by an all-by-all BLAST of the
genome neighbors. (C) Three pathways catalyzed by members of the proline
racemase family. The nodes in the GNN (panel B) are colored using
the color clusters in the SSN (Panel A). Figures reproduced with permission
from ref (97).
(A) A colored SSN for
the proline racemase family (PF05544; InterPro
Release 43.0). (B) The GNN generated by an all-by-all BLAST of the
genome neighbors. (C) Three pathways catalyzed by members of the proline
racemase family. The nodes in the GNN (panel B) are colored using
the color clusters in the SSN (Panel A). Figures reproduced with permission
from ref (97).The EFI subsequently developed
the Enzyme Function Initiative-Genome
Neighborhood Tool (EFI-GNT; http://efi.igb.illinois.edu/efi-gnt/) to provide a “user friendly” interface for generating
GNNs to facilitate the identification of pathway/metabolic context
for isofunctional clusters in SSNs. Although EFI-GNT has not yet been
“officially” announced with a detailed publication (a
manuscript describing the updated version of EFI-EST and EFI-GNT is
in preparation for publication later this year), >250 unique users
have accessed the web tool that is available for community use.An SSN generated by EFI-EST is the input for EFI-GNT [Figure A; 6419 sequences
in the proline racemase family in InterPro Release 63.0 (May 2017)].
EFI-GNT assigns a unique color (from a palette of 1513 colors) to
each cluster (Figure B). It then interrogates the European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena) database
for the neighbors of each sequence in each cluster in the input SSN
(for eubacteria, archaea, and fungi), and the neighbors are associated
with their Pfam families. The co-occurrence frequencies of the queries
in the SSN cluster with the neighbors as well as the absolute values
of the distances in open reading frames (orfs) between the queries
and neighbors are calculated. Functionally linked genes encoding a
pathway are expected to have (1) large query-neighbor co-occurrence
frequencies (diminished if operon/gene cluster organization is phylogenetically
diverse) and (2) short distances between the queries and neighbors.
Figure 6
(A) SSN
for the proline racemase family (PF05544, InterPro Release
63.0) segregated with an alignment score of ≥110 (≥58%
pairwise sequence identity). (B) Colored SSN generated by the EFI-GNT
web tool. (C, D) GNN with SSN cluster hub-nodes and Pfam family spoke-nodes.
(E, F) GNN with Pfam family hub-nodes and SSN cluster spoke-nodes.
The GNNs were generated with a ±10 orf genome neighborhood window
and a query-neighbor co-occurrence threshold of 20%.
(A) SSN
for the proline racemase family (PF05544, InterPro Release
63.0) segregated with an alignment score of ≥110 (≥58%
pairwise sequence identity). (B) Colored SSN generated by the EFI-GNT
web tool. (C, D) GNN with SSN cluster hub-nodes and Pfam family spoke-nodes.
(E, F) GNN with Pfam family hub-nodes and SSN cluster spoke-nodes.
The GNNs were generated with a ±10 orf genome neighborhood window
and a query-neighbor co-occurrence threshold of 20%.EFI-GNT provides GNNs in two formats. In one format
(Figure C,D), a cluster
is present
for each SSN cluster: the hub-node represents the sequences in the
SSN cluster (colored with a unique color so that it can be easily
identified in a colored version of the input SSN that is generated),
and the spoke-nodes represent the neighbor Pfam families; this format
allows the user to identify the pathway enzymes. In the second format,
a cluster is present for each neighbor Pfam family: the hub-node represents
the Pfam family, and the spoke nodes represent the SNN clusters that
identified the neighbors (Figure E,F); this format allows the user to assess whether
the similarity (edge) threshold used to generate the input SSN was
too large (pairwise identity too large) so that orthologues are segregated
in multiple clusters, with these identifying the same Pfam family
neighbors and pathway.In both GNN formats, the co-occurrence
frequencies of the SSN queries
and neighbors are the values of the edges between the hub- and spoke-nodes:
if the co-occurrence frequency exceeds a user-specified threshold,
the edge and spoke-node are present. From the co-occurrence frequencies,
the user can identify neighbors that “always” occur
with the query (the same conserved operon/gene cluster) as well as
those that are less frequently associated (operon/gene cluster in
some species; dispersed genes in other species).EFI-GNT also
provides files with the UniProt IDs for the sequences
in each neighbor Pfam family that can be used to identify the neighbors
in the SSNs for their families. This mapping (1) assists the selection
of alignment score thresholds for segregating the neighbor SSNs into
isofunctional clusters/families and (2) provides useful context about
possible functional (substrate specificity and reaction mechanism)
relationships that may be useful in deducing in vitro activities and in vivo metabolic functions.
Integrated
Use of SSNs and GNNs To Discover Metabolic Pathways
The synergistic
“power” of the EFI-EST and EFI-GNT
web tools for functional annotation of bacterial and fungal enzymes
is the ability to (1) segregate protein families into isofunctional
clusters in an SSN using EFI-EST (the sequences in a cluster have
the same genome context) and (2) use the SSN as the input for EFI-GNT
to interrogate and visualize genome neighborhood context for the isofunctional
clusters in the GNN. To the best of our knowledge, no other web tools
provide this integrated capability.The GNN format in which
the hub-node represents the SSN cluster and the spoke-nodes represent
the Pfam families (Figure C,D) can be used to identify the enzymes, transcriptional
regulators, and transporters in a metabolic pathway. For example,
continuing with the proline racemase family (PF05544; SSN in Figure A,B), the enzymes
in a catabolic pathway for the conversion of trans-4-hydroxyproline to α-ketoglutarate (middle pathway in Figure C) can be identified
for cluster 16 in the input SSN (Figure D, 792 sequences with genome neighborhoods
in the ENA files). In addition to 4-hydroxyproline epimerase (the
queries in cluster 16 and the SSN hub-node in the GNN cluster in Figure D), the Pfam family
spoke-nodes of the GNN cluster identify the three remaining enzymes
in the pathway: (1) cis-4-hydroxyproline oxidase,
a member of the d-amino acid oxidase family (“DAO”
in Figure D; PF01266,
co-occurrence frequency, 0.91, median distance 1.0 orfs); (2) cis-4-hydroxyproline imino acid dehydratase/deaminase, a
member of the dihydrodipicolinate synthase family (“DHDPS”;
PF00701, co-occurrence frequency, 0.82, median distance 2.0 orfs);
and (3) α-ketoglutarate semialdehyde dehydrogenase, a member
of the aldehyde dehydrogenase family (“Aldedh”; PF00171,
co-occurrence frequency, 0.66, median distance 2.0 orfs). The curations
provided by Pfam provide essential clues for deducing the identities
of the reactions catalyzed by the various neighboring enzymes (conserved
reaction mechanisms).The GNN in Figure D also includes (1) the ATP-bonding component
of an ABC transport
system (“ABC_trans”, PF00005, co-occurrence frequency,
0.35, median distance 4.0 orfs), (2) an additional membrane component
of the ABC transport system (“BPD_transp_1”, PF00528,
co-occurrence frequency, 0.31, median distance 3.0 orfs), and (3)
a bidomain transcriptional regulator (“GntR-FCD”, PF00392
and PF07729, co-occurrence frequency, 0.67, median distance 3.0 orfs).The GNN analysis also recognizes genome neighbors that are not
associated with any Pfam family (“none” in Figure D; ∼15% of
the proteins in UniProt are not associated with a Pfam family). These
sequences can contain protein families currently not curated by Pfam;
these families can be defined by generating SSNs for these sequences
using Option D of EFI-EST.The GNN in Figure D was generated with a minimum co-occurrence
frequency of 0.30. At
lower co-occurrence frequencies (Figure ), members of four families of solute binding
proteins [SBPs; Peripla_BP_6 (PF13458), SBP_bac_3 (PF00497), Peripl_BP_8
(PF13416), and SBP_bac_5 (PF00496)] for ABC transport systems also
are genome proximal to the SSN queries with co-occurrence frequencies
of 0.16, 0.11, 0.07, and 0.03, respectively, and median distances
of 6.0, 5.0, 2.0, and 6.0 orfs, respectively. Also members of the
major facilitator superfamily (MFS_1, PF07690) and an amino acid permease
family (AA_permease_2 family, PF13520) are genome proximal to the
SSN queries with co-occurrence frequencies of 0.15 and 0.11, respectively,
and median distances of 9.0 and 2.0 orfs, respectively. The enzymes
in metabolic pathways usually are conserved (orthologues instead of
analogues; vide infra), but transport systems and
transcriptional regulators often are not conserved, so members of
multiple families of transporters and regulators may be genome proximal
to the queries in the SSN cluster.
Figure 7
GNN for SSN cluster 16 presented at different
query-neighbor co-occurrence
frequencies. (A) 3%. (B) 5%. (C) 10%. (D) 12%. (E) 15%. (F) 20%.
GNN for SSN cluster 16 presented at different
query-neighbor co-occurrence
frequencies. (A) 3%. (B) 5%. (C) 10%. (D) 12%. (E) 15%. (F) 20%.Figure illustrates
the ability of GNNs to analyze genome neighborhoods as a function
of co-occurrence frequency, thereby allowing the identification of
pathways that may be encoded by single genome neighborhoods in some
species and multiple genome neighborhoods in other species. An example
of the utility of this capability is described in the next section.[34]
Use of Transport System SBPs To Anchor Pathway
Prediction Using
SSNs and GNNs
For uncharacterized pathways, pathway prediction
is facilitated by independent information about the substrate for
the first enzyme in the pathway. For microbial enzymes in catabolic
pathways, such information can be obtained from the identity of the
solute for the transporter (or the ligand for a transcriptional regulator).
For ABC, TRAP, and TCT transport systems, the solute is conveyed to
the membrane components with a soluble extracellular (Gram-positive)/periplasmic
(Gram-negative) solute binding protein (SBP); SBPs can be purified
on large scale and subjected to ligand screening with differential
scanning fluorimetry (DSF)/ThermoFluor using a physical library of
small molecules.[98] These ligand specificities
anchor the pathway by identifying the substrate for the first enzyme;
the Pfam families of the neighbors allow the reactions to be predicted.
Experiments, both in vitro and in vivo, are required to validate the pathway.Using this strategy,
experimentally determined ligands for SBPs and synergistic use of
SSNs and GNNs to identify pathway components, the EFI identified several
novel catabolic pathways. A particularly informative example is the
discovery of catabolic pathways for the three tetritols, d-threitol, l-threitol, and erythritol, in Mycobacterium
smegmatis.[34] Ligand screening
identified one SBP for an ABC transporter that bound d-threitol;
a genome-proximal dehydrogenase catalyzed its oxidation; however,
other catabolic enzymes were encoded elsewhere in the genome (Figure A). These “missing”
enzymes were discovered by first constructing the SSN for the d-threitol dehydrogenase and then the GNN for the cluster containing
the dehydrogenase—this identified a d-erythrulose
kinase that was encoded by a gene cluster distal to the one containing
the SBP and d-threitol dehydrogenase in M. smegmatis (but not other species that encode the pathway). The SSN for the
kinase family was then constructed, and the cluster containing the d-erythrulose kinase was used to construct the GNN; this identified
a second gene cluster distal to both the one containing the SBP and d-threitol dehydrogenase and the one containing the d-erythrulose kinase that contained isomerases to complete the d-threitol pathway. Investigation of other genes in both distal
clusters allowed identification of the remaining enzymes in the pathway
for d-threitol catabolism as well as the enzymes in the pathways
for l-threitol and erythritol catabolism (Figure B). The ligand specificity
of a single SBP was sufficient to identify enzymes for three catabolic
pathways encoded by three distal gene clusters.
Figure 8
(A) Strategy for discovering
catabolic pathways for d-threitol, l-threitol, and
erythritol in M. smegmatis using
differential scanning fluorimetry (DSF) to screen the ligand specificities
of SBPs and the integrated used of SSNs and GNNs to discover the pathway
enzymes. (B) Catabolic pathways for d-threitol, l-threitol, and erythritol. (C) Catabolic pathways for d-threonate, l-threonate, and d-erythronate in R. eutropha H16.[59] Figures in Panel A and B reproduced
with permission from ref (34); figure in Panel C reproduced with permission from ref (59).
(A) Strategy for discovering
catabolic pathways for d-threitol, l-threitol, and
erythritol in M. smegmatis using
differential scanning fluorimetry (DSF) to screen the ligand specificities
of SBPs and the integrated used of SSNs and GNNs to discover the pathway
enzymes. (B) Catabolic pathways for d-threitol, l-threitol, and erythritol. (C) Catabolic pathways for d-threonate, l-threonate, and d-erythronate in R. eutropha H16.[59] Figures in Panel A and B reproduced
with permission from ref (34); figure in Panel C reproduced with permission from ref (59).The EFI also used this strategy to assign functions to members
of Domain of Unknown Function 1537 (DUF 1537; approximately 20% of
the 16 712 Pfam families in Release 31.0 are families of DUFs
or proteins of unknown function).[59] Using
the specificities for four SBPs for TRAP transport systems for four-carbon
acid sugars, including d-erythronate and l-erythronate,
SSNs and GNNs were used to identify two genome neighborhoods in Ralstonia eutropha H16 that encode enzymes in catabolic
pathways for d-threonate, l-threonate, and d-erythronate (Figure C). Members of the DUF1537 family (Pfam families PF07005 and PF17402)
were determined to be kinases for four-carbon acid sugars, identifying
a previously uncharacterized family of kinases. In addition, members
of the PdxA2 family (PF04166) were determined to be oxidative decarboxylases
that generate dihydroxyacetone phosphate (DHAP) and CO2.In unpublished work, the specificities of three ABC SBPs
for d-apiose, a branched chain pentose found in plant cell
walls,
and the iterative use of SSNs and GNNs have been used to discover
five catabolic pathways for d-apiose, a branched aldose,
two of which are found in species in the human gut microbiome (humans
ingest plant cell walls; species of Bacteroides can degrade the rhamnogalacturonan-II
component that contains d-apiose to release d-apiose
that can be catabolized[99]). Two pathways
include novel RuBisCO-like proteins (RLPs) from the RuBisCO superfamily,
one catalyzes a β-ketoacid decarboxylation and the second catalyzes
a “transcarboxylation” in which the substrate is decarboxylated
(β-ketoacid decarboxylation), with the sequestered CO2 used to carboxylate the enediolate intermediate on the adjacent
carbon, and the resulting isomeric β-ketoacid undergoes hydrolysis
as in the canonical RuBisCO reaction. The experimentally determined
specificity of three SBPs anchored discovery of five pathways by identifying
the substrates; the iterative use of SSNs and GNNs identified the
enzymes.
Comments
The success of the integrated application
of SSNs and GNNs to discover metabolic pathways is limited by the
proximities of the genes encoding the pathway components, so this
analysis may not be successful for all functional assignment problems.
However, the large-scale nature of the analyses provides the potential
to determine whether colocalization of genes is due to limited genetic
drift among similar genomes or pathway conservation among phylogenetically
diverse genomes; it also allows identification of low co-occurrence
frequency but significant clustering of the genes encoding multiple
pathway components that would be tedious to discover by examination
of large numbers of individual genome neighborhoods.[34]Also, SSNs provide the ability to segregate members
of mechanistically diverse superfamilies and functionally diverse
suprafamilies into isofunctional clusters (families). For enzymes
an important test of isofunctionality is that the GNN generated for
an SSN cluster identifies the components of a single pathway. The
iterative use of SSNs and GNNs not only provides a test of isofunctionality
but also a method for determining the minimum SSN alignment score
required to achieve isofunctionality. If the GNN for an SSN cluster
identifies “too many” components for a single pathway,
further segregation of the cluster with a larger alignment score into
“daughter” clusters may allow the resolution of the
pathways. The reader should recognize that achieving isofunctional
clusters in an SSN may not be straightforward, e.g., even within the
same superfamily different alignment scores may be required to achieve
isofunctional clusters. However, the integration of SSNs and GNNs
using EFI-EST and EFI-GNT provides a powerful strategy for assessing
and achieving isofunctional clusters.
Chemically Guided Functional
Profiling: Building on EFI-EST
With ∼50% of the proteins
in the sequence databases having
incorrect, uncertain, or unknown functions, devising a target selection
strategy is a major challenge for functional assignment. The SSNs
for functionally diverse enzyme families often have many uncharacterized
clusters—the problem is deciding which are worth experimental
characterization. One approach is to select those that are most biologically
relevant, but how is that achieved in the absence of knowledge of
their functions?Balskus and Huttenhower recently described
a strategy for choosing biologically relevant targets termed “chemically
guided functional profiling”.[72] This
strategy involves (1) construction of the SSN for a targeted protein
family segregated into isofunctional families and (2) mapping the
abundance of metagenome reads to the clusters in the SSN, with uncharacterized
clusters having the largest number of metagenome markers the highest
priority for functional characterization (Figure A). ShortBRED[100] provides a fast and accurate method to profile metagenome samples
and uses sequence fragments from the clusters in the SSN (“markers’)
to identify homologous sequences in the metagenome reads; their abundance
is then mapped to the SSN clusters to accomplish target selection.
Figure 9
(A) Strategy
for chemically guided functional profiling. (B) SSN
for the glycyl radical enzyme superfamily showing clusters with previously
assigned functions as well as clusters (15 and 16) for which chemically
guided functional profiling was used to leverage experimental functional
assignment. Figures reproduced with permission from ref (72).
(A) Strategy
for chemically guided functional profiling. (B) SSN
for the glycyl radical enzyme superfamily showing clusters with previously
assigned functions as well as clusters (15 and 16) for which chemically
guided functional profiling was used to leverage experimental functional
assignment. Figures reproduced with permission from ref (72).The utility of chemically guided functional profiling was
demonstrated
using the glycyl radical enzyme (GRE) superfamily; the reactions are
initiated by abstraction of a hydrogen atom from the substrate by
a glycine-centered backbone radical (generated by an activase from
the S-adenosyl methionine superfamily). The metagenome
samples used for target selection were from the human gut microbiome,
so uncharacterized members of the GRE superfamily are likely involved
in reactions that allow the microbiome to utilize small molecules
in the gut. Balskus previously had identified choline trimethylamine-lyase
(CutC) in human gut microbiome species; CutC catalyzes the cleavage
of choline to acetaldehyde and trimethylamine, the latter involved
in the production of methane as well as implicated in human diseases
via its N-oxide.[101,102]The
SSN for the GRE family is shown in Figure B. The functionally assigned clusters are
colored, as are two clusters (15 and 16) that were identified as abundant
in the human gut microbiome. Both of the latter clusters were hypothesized
to be dehydratases based on conserved active site residues associated
with known dehydratase reactions. Cluster 15 was characterized as
a 4-hydroxyproline dehydratase; again, genome context was used to
predict the substrate because of its proximity to Δ1-pyrroline-5-carboxylate (P5C) reductase that reduces P5C that would
be derived from dehydration of 4-hydroxyproline to proline. Cluster
16 was characterized as a novel (S)-1,2-propanediol
dehydratase (a previously characterized analogue is an adenosylcobalamin-dependent
enzyme); the identity of the substrate was suggested from genome analysis
because the enzyme is found in Roseburia inulinivorans that catabolizes l-fucose but lacks the adenosylcobalamin-dependent
dehydratase.A “user friendly” web tool is not
yet available to
allow the community to use “chemically guided functional profiling”
with their favorite families. But, the development of a web tool is
a high priority goal given its ability to identify important targets
for functional characterization.
AGeNNT and Refined GNNs:
Building on EFI-GNT
EFI-GNT
provides GNNs in two formats that summarize (1) the Pfam families
identified by each SSN cluster (edges between SSN cluster hub-nodes
and Pfam family spoke-nodes), providing information about the reactions
in metabolic pathways, and (2) the SSN clusters that identify each
Pfam family (edges between Pfam family hub-nodes and SSN cluster spoke-nodes),
providing information about whether multiple clusters may contain
orthologues.Merkl and co-workers recently described AGeNNT
(Automatically Generates refined Neighborhood NeTworks), a Java application
that uses the GNNs provided by EFI-GNT to generate a third format
(“refined GNN”) in which all of the SSN cluster and
Pfam family nodes are connected by edges.[71] Clusters that contain orthologues, identified when they share the
same genome neighbors, can be distinguished from clusters that have
different genome contexts. An SSN is submitted to the EFI-GNT web
tool. AGeNNT then generates the refined GNN. Several options are provided,
including (1) eliminating overrepresented phylogenetically related
subspecies from the input SSN to reduce redundancy in the GNN and
(2) using a user-defined “whitelist” of Pfam families
to include in the refined GNN. For example, only Pfam families for
enzymes can be included in the refined GNN so Pfam cluster connections
between SSN clusters that involve transporters and transcriptional
regulators are eliminated (in contrast to pathway enzymes, transporters
and transcriptional regulators are not conserved).Continuing
again with the proline racemase family (PF05544) to
provide an example, several major clusters from the SSN were selected
for generation of GNNs using EFI-GNT and the refined GNN using AGeNNT
(Figure ). The colored
SSN is shown in Figure A, the SSN cluster hub-node GNN format is shown in Figure B, the Pfam family
hub-node GNN format is shown in Figure C, and the refined GNN is shown in Figure D (Pfam families
for transport systems and transcriptional regulators are deleted in
the GNNs; because these families are not conserved in pathways (vide supra), their inclusion in the refined GNN can complicate
the analysis). Comparison of the refined GNN with the GNNs establishes
the utility of the refined GNN in identifying orthologous SSN clusters:
clusters 2, 4, 5, and 6 are orthologous 4-hydroxyproline epimerases;
clusters 1 and 3 are orthologous trans-3-hydroxylproline
dehydratases; and cluster 7 is proline racemase (using functional
assignments based on experimental verification[97]). Building on EFI-EST and EFI-GNT, AGeNNT links SSN clusters
that share pathway context, potentially identifying interrelations
of subfamilies within a protein family.
Figure 10
(A) Colored SSN generated
by EFI-GNT for selected clusters in the
proline racemase family (PF05544). (B) GNN with SSN cluster hub-nodes
and Pfam family spoke-nodes. (C) GNN with Pfam family hub-nodes and
SSN cluster spoke-nodes. (D) Refined GNN showing identification of
three different functions as deduced by connections (or lack thereof)
between SSN cluster and Pfam family nodes.
(A) Colored SSN generated
by EFI-GNT for selected clusters in the
proline racemase family (PF05544). (B) GNN with SSN cluster hub-nodes
and Pfam family spoke-nodes. (C) GNN with Pfam family hub-nodes and
SSN cluster spoke-nodes. (D) Refined GNN showing identification of
three different functions as deduced by connections (or lack thereof)
between SSN cluster and Pfam family nodes.
Future Directions
EFI-EST and EFI-GNT provide experimentalists
with otherwise inaccessible but essential perspectives on sequence–function
space in protein families and genome context that facilitate the assignment
of functions to uncharacterized enzymes. Other web tools are available
for smaller scale analysis of protein families, but genomic enzymology
“requires” large-scale analyses to provide the maximum
amount of context.Other large-scale web tools can be imagined.
For example, the proteome of an organism (or of a community) determines
its metabolic capabilities; therefore, an easy-to-construct overview
of the metabolic potential would be useful and could be provided by
a “proteome network” (PN) tool. A PN would include a
node for each protein encoded by a genome (or community) and collected
into Pfam family clusters (Pfam family hub-node and protein spoke
nodes). The PN would identify the catalytic capabilities via the identities
of the Pfam families and, also, the locations of the proteins (spoke
nodes) in the SSNs for their families. For a community PN, identification
of species-specific Pfam families could provide the potential to identify
syntrophic metabolic pathways, e.g., different organisms contribute
different metabolic capabilities to synthesize a natural product or
degrade an energy source. In analogy with chemically guided functional
profiling, mapping transcriptome abundance to the PN would provide
a visually powerful approach for identifying enzymes in novel pathways.Also, the Pfam families that contribute enzymes to a pathway often
are conserved in phylogenetically diverse organisms; however, we have
observed that one or more reactions in a metabolic pathway can be
catalyzed by analogues (nonorthologous gene replacements) in different
taxonomic ranks, e.g., phyla, class, order, or family. The ability
to discover analogues may be enhanced by clustering members of a protein
family by taxonomic rank instead of pairwise sequence identity (SSNs).
Because the node attributes that are provided by EFI-EST for sequences
include taxonomic ranking, a taxonomic rank network (“TRN”)
would be easy to construct. Subsequent generation of sequence similarity-based
SSNs for individual clusters in the TRN would be accomplished with
Option D of EFI-EST, thereby providing the ability to further segregate
and analyze the clusters by sequence homology.Finally, although
the generation of an SSN is straightforward,
Release 31.0 of the Pfam database (Release 31.0) defines 16 712
families. Immediate access to a library of precomputed SSNs for all
Pfam families would provide the biological and biomedical communities,
including users of web tools that identify BGCs (vide supra), with the ability to quickly place their favorite enzymes in the
context sequence–function relationships for their protein families.
This library of SSNs should be regularly updated to provide current
information (perhaps in parallel with releases of the InterPro database),
but its construction requires considerable computational resources.
We have demonstrated that the calculation of this database is feasible,
although we have not yet been able to initiate the production phase
of this effort.I encourage the readers to (1) try the EFI-EST
and EFI-GNT web
tools, (2) imagine new applications for SSNs and GNNs, and (3) identify
additional large-scale data visualization and analysis challenges
that would be amenable to solution by community-accessible web tools.
Like the natural products community, the enzymology community needs
to recognize the essential role of web tools that allow the protein
and genome sequence databases to be leveraged for the solution of
biological problems.
Authors: S D Bentley; K F Chater; A-M Cerdeño-Tárraga; G L Challis; N R Thomson; K D James; D E Harris; M A Quail; H Kieser; D Harper; A Bateman; S Brown; G Chandra; C W Chen; M Collins; A Cronin; A Fraser; A Goble; J Hidalgo; T Hornsby; S Howarth; C-H Huang; T Kieser; L Larke; L Murphy; K Oliver; S O'Neil; E Rabbinowitsch; M-A Rajandream; K Rutherford; S Rutter; K Seeger; D Saunders; S Sharp; R Squares; S Squares; K Taylor; T Warren; A Wietzorrek; J Woodward; B G Barrell; J Parkhill; D A Hopwood Journal: Nature Date: 2002-05-09 Impact factor: 49.962
Authors: Hao Fan; Daniel S Hitchcock; Ronald D Seidel; Brandan Hillerich; Henry Lin; Steven C Almo; Andrej Sali; Brian K Shoichet; Frank M Raushel Journal: J Am Chem Soc Date: 2013-01-02 Impact factor: 15.419
Authors: Alissa M Goble; Rafael Toro; Xu Li; Argentina Ornelas; Hao Fan; Subramaniam Eswaramoorthy; Yury Patskovsky; Brandan Hillerich; Ron Seidel; Andrej Sali; Brian K Shoichet; Steven C Almo; Subramanyam Swaminathan; Martin E Tanner; Frank M Raushel Journal: Biochemistry Date: 2013-09-04 Impact factor: 3.162
Authors: Kai Blin; Thomas Wolf; Marc G Chevrette; Xiaowen Lu; Christopher J Schwalen; Satria A Kautsar; Hernando G Suarez Duran; Emmanuel L C de Los Santos; Hyun Uk Kim; Mariana Nave; Jeroen S Dickschat; Douglas A Mitchell; Ekaterina Shelest; Rainer Breitling; Eriko Takano; Sang Yup Lee; Tilmann Weber; Marnix H Medema Journal: Nucleic Acids Res Date: 2017-07-03 Impact factor: 16.971
Authors: Nicholas Furnham; Natalie L Dawson; Syed A Rahman; Janet M Thornton; Christine A Orengo Journal: J Mol Biol Date: 2015-11-14 Impact factor: 5.469
Authors: Michael A Welsh; Atsushi Taguchi; Kaitlin Schaefer; Daria Van Tyne; François Lebreton; Michael S Gilmore; Daniel Kahne; Suzanne Walker Journal: J Am Chem Soc Date: 2017-11-30 Impact factor: 15.419
Authors: Michael S Carter; Xinshuai Zhang; Hua Huang; Jason T Bouvier; Brian San Francisco; Matthew W Vetting; Nawar Al-Obaidi; Jeffrey B Bonanno; Agnidipta Ghosh; Rémi G Zallot; Harvey M Andersen; Steven C Almo; John A Gerlt Journal: Nat Chem Biol Date: 2018-06-04 Impact factor: 15.040
Authors: Joshua B Pyser; Summer A Baker Dockrey; Attabey Rodríguez Benítez; Leo A Joyce; Ren A Wiscons; Janet L Smith; Alison R H Narayan Journal: J Am Chem Soc Date: 2019-11-06 Impact factor: 15.419
Authors: Jonathan R Chekan; Chayanid Ongpipattanakul; Terry R Wright; Bo Zhang; J Martin Bollinger; Lauren J Rajakovich; Carsten Krebs; Robert M Cicchillo; Satish K Nair Journal: Proc Natl Acad Sci U S A Date: 2019-06-17 Impact factor: 11.205
Authors: Grace E Kenney; Laura M K Dassama; Anastasia C Manesis; Matthew O Ross; Siyu Chen; Brian M Hoffman; Amy C Rosenzweig Journal: J Biol Chem Date: 2019-09-11 Impact factor: 5.157