Brian F Fisher1, Harrison M Snodgrass1, Krysten A Jones2, Mary C Andorfer3, Jared C Lewis1. 1. Department of Chemistry, Indiana University, Bloomington, Indiana 47405, United States. 2. Department of Chemistry, University of Chicago, Chicago, Illinois 60637, United States. 3. Department of Biology, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States.
Abstract
Enzymes are powerful catalysts for site-selective C-H bond functionalization. Identifying suitable enzymes for this task and for biocatalysis in general remains challenging, however, due to the fundamental difficulty of predicting catalytic activity from sequence information. In this study, family-wide activity profiling was used to obtain sequence-function information on flavin-dependent halogenases (FDHs). This broad survey provided a number of insights into FDH activity, including halide specificity and substrate preference, that were not apparent from the more focused studies reported to date. Regions of FDH sequence space that are most likely to contain enzymes suitable for halogenating small-molecule substrates were also identified. FDHs with novel substrate scope and complementary regioselectivity on large, three-dimensionally complex compounds were characterized and used for preparative-scale late-stage C-H functionalization. In many cases, these enzymes provide activities that required several rounds of directed evolution to accomplish in previous efforts, highlighting that this approach can achieve significant time savings for biocatalyst identification and provide advanced starting points for further evolution.
Enzymes are powerful catalysts for site-selective C-H bond functionalization. Identifying suitable enzymes for this task and for biocatalysis in general remains challenging, however, due to the fundamental difficulty of predicting catalytic activity from sequence information. In this study, family-wide activity profiling was used to obtain sequence-function information on flavin-dependent halogenases (FDHs). This broad survey provided a number of insights into FDH activity, including halide specificity and substrate preference, that were not apparent from the more focused studies reported to date. Regions of FDH sequence space that are most likely to contain enzymes suitable for halogenating small-molecule substrates were also identified. FDHs with novel substrate scope and complementary regioselectivity on large, three-dimensionally complex compounds were characterized and used for preparative-scale late-stage C-H functionalization. In many cases, these enzymes provide activities that required several rounds of directed evolution to accomplish in previous efforts, highlighting that this approach can achieve significant time savings for biocatalyst identification and provide advanced starting points for further evolution.
Enzymes can be powerful
tools for the synthesis of fine chemicals,
pharmaceuticals,[1] agrochemicals, and many
other materials.[2] On the other hand, the
very features that give rise to the selectivity and catalytic proficiency
of enzymes acting on their native substrates often lead to high substrate
specificity and thus poor activity on non-native substrates. The dearth
of enzymes available for reactions of interest can therefore be a
major impediment to implementing enzymes in synthetic routes. Many
of the enzymes commonly used today (e.g., ketoreductases, transaminases,
cytochromes P450, etc.) were originally identified via arduous enzymology
aimed at clarifying their native biological activities. Often, many
rounds of directed evolution were also required to optimize these
enzymes for synthetic applications. Expanding the number of known
biocatalysts, with an emphasis on exploring broad sequence diversity
within an enzyme family, could therefore greatly facilitate the use
of enzymes in chemical synthesis.[3,4]Numerous
methods have been used to explore the functional diversity
of naturally occurring enzymes in discrete genomes, metagenomic samples,
and sequence databases.[5] Advances in DNA
sequencing—in particular, metagenome sequencing—have
resulted in an explosion of the size of protein sequence databases.[6] Coupled with the decreasing cost of gene synthesis,[7] mining these sequence databases for potential
biocatalysts is becoming increasingly accessible to scientists.[8] Such approaches are most commonly used to identify
enzymes that act on a substrate of interest, often a chromogenic probe
compound chosen more for ease of screening than for synthetic utility,[9,10] but efforts to profile the activity of entire enzyme families on
a range of substrates and identify biocatalysts with a collectively
broad substrate scope are less common. Family-wide profiling efforts
include investigations on phosphatases,[11] metallo-β-lactamases,[9,12] and glutathione-S-transferases.[13] Studies on dehalogenases,[14] esterases,[15] glycosyl transferases,[16] and imine reductases[17] highlight the potential synthetic utility of enzymes identified
from such efforts. Comparable genome mining efforts on enzymes that
functionalize C–H bonds have not been reported.Flavin-dependent
halogenases (FDHs), which catalyze site-selective
C–H halogenation of electron-rich aromatic compounds, have
been studied extensively due to their potential synthetic utility.[18,19] Late-stage functionalization,[20] sequential
halogenation/cross-coupling,[21−23] and preparative-scale halogenation[24] have all been accomplished using these enzymes.
Our efforts have focused primarily on RebH, an FDH that was identified
in studies aimed at elucidating the biosynthetic pathway of the antitumor
compound rebeccamycin.[25] In this context,
RebH catalyzes site-selective chlorination of tryptophan, and it has
since been shown to halogenate a range of indoles and anilines.[26] Our group has also shown that directed evolution
can be used to create RebH variants with improved thermal stability,[27] high activity on large, biologically active
compounds,[20] and high selectivity for different
sites on target compounds.[28] While effective,
these efforts required 3–10 rounds of directed evolution due
to the wild-type enzyme’s modest stability, low activity on
large substrates, and high regioselectivity.While additional
FDHs could therefore expand the utility of these
enzymes for synthesis, only a relatively narrow set of FDHs have been
investigated for biocatalysis.[19] FDHs that
catalyze tryptophan chlorination, such as RebH, Thal,[29] SttH,[30] and PrnA,[31] in particular are over-represented. Fungal halogenases,
such as Rdc2,[32] RadH,[33] and GsfI,[34] which natively chlorinate
phenol-containing substrates, have also been shown to be active and
selective biocatalysts. Literature reports on the collective substrate
scopes of the FDHs reported to date suggest that they prefer chloride
over other halides and that they act on electron-rich compounds similar
to their native substrate.[19] On the other
hand, the existence of a range of complex halogenated natural products
distinct from those produced by well-characterized biosynthetic gene
clusters[35] implies that FDHs with unique
substrate scopes might be found in less characterized halogenase subgroups.[36] We hypothesized that exploring uncharacterized
FDHs found in protein sequence databases could, together with currently
characterized enzymes, form a diverse starting toolkit for selective
late-stage C–H halogenation.Herein, we describe the
use of a high-throughput mass-spectrometry-based
screen to evaluate a broad set of over 100 putative FDH sequences
drawn from throughout the FDH family. Halogenases with novel substrate
scope and complementary regioselectivity on large, three-dimensionally
complex compounds were identified. This effort involved far more extensive
sequence–function analysis than has been accomplished using
the relatively narrow range of FDHs characterized to date, providing
a clearer picture of the regions in FDH sequence space that are most
likely to contain enzymes suitable for halogenating small-molecule
substrates. The representative enzyme panel constructed in this study
also provides a rapid means to identify FDHs for lead diversification
via late-stage C–H functionalization. In many cases, these
enzymes provide activities that required several rounds of directed
evolution to accomplish in previous efforts, highlighting that this
approach can achieve significant time savings for biocatalyst identification
and provide advanced starting points for further evolution.
Results
Organization
of Halogenase Sequence-Similarity Network
A BLAST search
of the UniProt sequence database using RebH as a query
sequence and an E-value threshold of 10–5 generated 3975 unique hits spanning a range of sequence and host
diversity, including bacterial, archaeal, eukaryotic, and viral proteins.
Nearly all (>90%) previously reported FDHs are present in this
set.
The dinucleotide-binding GxGxxG motif, characteristic of FAD-binding
proteins, is present in 92% of the sequences, and the WxWxIP motif,[37] characteristic of FDHs but absent in flavin-dependent
monooxygenases, is found in 69% of the sequences. The latter value
increases to 78% when motif variants WxWxI[R,G][38] are included. Collectively, these analyses suggest that
the majority of the sequences examined are likely FDHs.Sequence-similarity
networks (SSNs)[39] were then used to visualize
functional relationships among putative FDH sequences. In this representation,
protein sequences are illustrated as nodes in a network graph that
are connected by edges (lines) to other sequences that exceed a specified
pairwise sequence similarity. An SSN was generated for the entire
FDH sequence set with a permissive edge detection threshold (corresponding
to ≈30% sequence identity) using the Enzyme Function Initiative-Enzyme
Similarity Tool (EFI-EST).[40,41] Previously reported
data for 129 known enzymes found among the BLAST hits were mapped
onto this Level 1 SSN to explore subnetwork colocalization of enzyme
properties. The clearest defining features of the individual subnetworks
are host domain and compound class—indole, phenol, or pyrrole—of
native substrates for known FDHs within the subnetworks (Figure A), the latter suggesting
that the SSN might provide a framework for identifying enzymes that
act on specific compound classes and for surveying regions of sequence
space where substrate preference is unknown.
Figure 1
(A) Sequence-similarity
network for flavin-dependent halogenases.
Each circle is a representative node, grouping protein sequences with
>50% sequence identity as determined by CD-HIT.[65] Edge detection threshold set at alignment score of 70 (≈30%
sequence identity). Nodes are filled according to native substrate
functional group of at least one sequence in the representative node;
colored stroke indicates domain (thin black stroke = bacterial). Subnetworks
with ≥15 sequences but without any characterized protein are
labeled numerically. Level 2 subnetworks formed from the Indole (B)
and Phenol (C) subnetwork using a stricter alignment score cutoff
of 140 (≈40% sequence identity). Level 2 subnetworks are labeled
based on known sequences in the subnetwork. For Indole Subnetwork
sequences, nodes containing known tryptophan halogenases are filled
according to their regioselectivity, and subnetworks with ≥15
sequences are labeled numerically. For Phenol Subnetwork sequences,
nodes are filled according to the halogenase variant type (A = free
small molecule native substrate, B = ACP-tethered native substrate).
(A) Sequence-similarity
network for flavin-dependent halogenases.
Each circle is a representative node, grouping protein sequences with
>50% sequence identity as determined by CD-HIT.[65] Edge detection threshold set at alignment score of 70 (≈30%
sequence identity). Nodes are filled according to native substrate
functional group of at least one sequence in the representative node;
colored stroke indicates domain (thin black stroke = bacterial). Subnetworks
with ≥15 sequences but without any characterized protein are
labeled numerically. Level 2 subnetworks formed from the Indole (B)
and Phenol (C) subnetwork using a stricter alignment score cutoff
of 140 (≈40% sequence identity). Level 2 subnetworks are labeled
based on known sequences in the subnetwork. For Indole Subnetwork
sequences, nodes containing known tryptophan halogenases are filled
according to their regioselectivity, and subnetworks with ≥15
sequences are labeled numerically. For Phenol Subnetwork sequences,
nodes are filled according to the halogenase variant type (A = free
small molecule native substrate, B = ACP-tethered native substrate).The largest subnetwork, comprising 2270 sequences,
contains FDHs
that either natively halogenate tryptophan or have been shown to catalyze
indole halogenation in vitro. All known tryptophan
FDHs are found in this Indole Subnetwork, including tryptophan 5-,
6-, and 7-halogenases PyrH,[42] SttH,[43] and RebH.[44] BrvH,
a halogenase identified from metagenomic analysis,[45] and three recently reported halogenases from Xanthomonas
campestris(46) are also in this
subnetwork. Although the native substrates of these enzymes are not
known, they have been shown to halogenate a variety of small indoles.
A protein whose structure has been determined as part of structural
genomics efforts (PDB: 2PYX)[47] is also present in this
subnetwork, although its native activity is also unknown.The
second largest subnetwork comprises 438 sequences from bacteria
and fungi and includes most known phenol FDHs that, collectively,
halogenate a diverse range of phenol-containing substrates. For example,
the bacterial halogenase TiaM chlorinates a large macrocyclic intermediate
in the biosynthesis of tiacumicin B.[48] Bacterial
halogenases VhaA and Tcp21 chlorinate PCP-tethered amino acids in
the biosynthesis of the NRPS glycopeptide antibiotics vancomycin and
teicoplanin, respectively.[49] The putative
bacterial iodinase CalO3 is also present in the Phenol Subnetwork,[50] showcasing that substrate diversity also extends
to halide specificity. All fungal phenol FDHs that have been studied
as biocatalysts on diverse substrates, including Rdc2,[32] RadH,[33] and GsfI,[34] are also contained in this subnetwork.The third largest subnetwork, with 212 sequences, contains FDHs
that are involved in chlorinated pyrrole natural product biosynthesis.
The six pyrrole halogenases in the Pyrrole Subnetwork have an average
pairwise identity of 87%, and all are annotated in UniProt as PrnC,
which halogenates a pyrrole small-molecule intermediate in pyrrolnitrin
biosynthesis.[51] The halogenase PltM natively
halogenates a phenolic substrate, phloroglucinol, to produce chlorinated
compounds that induce biosynthesis of a pyrrole-containing natural
product, pyoluteorin.[52] Two proteins, Dox16
and Dox17, potentially halogenate phenolic moieties during the biosynthesis
of pyrrolomycins,[53] pyrrole-containing
compounds structurally similar to pyrrolnitrin. These observations
suggest that the common ancestor to enzymes in this subnetwork diverged
in substrate specificity to yield halogenases specialized for distinct
roles in chlorinated pyrrole natural product biosynthesis (Figure S18).Most of the smaller subnetworks
contain only uncharacterized proteins
(and are therefore simply assigned a number for reference in Figures or 2), but several include known FDHs with diverse native substrates.
One small subnetwork contains several enzymes, including MibH,[54] MscL,[55] and KrmI,[56] that natively chlorinate peptidyl tryptophan
side chains in macrocyclic lanthipeptide and NRPS natural products.
Other subnetworks include MalA and MalA′, which are responsible
for iterative chlorination in the biosynthesis of malbrancheamide,[38] ChlA, which chlorinates a phenol in DIF-1 biosynthesis,[57] and GetL, an enzyme suspected to be responsible
for chlorinating PCP-tethered histidine in the biosynthesis of tetrapeptide
antibiotics.[58] Halogenases responsible
for chlorinating ACP-tethered pyrroles (variant B pyrrole halogenases)
such as PltA[59,60] and Mpy16[61] occupied a subnetwork distinct from the larger subnetwork
that included the variant A pyrrole halogenase PrnC. A few subnetworks
contain enzymes that are not FDHs, including the flavin-dependent
monooxygenases Qhpg[62] and LodB[63] and putative geranylgeranyl reductases.[64]
Figure 2
(A) Sequence-similarity network for flavin-dependent halogenases,
drawn at the less stringent edge detection threshold (≈30%
identity), colored according to subnetwork. Subnetworks within the
Indole and Phenol Subnetworks at the more stringent threshold are
colored differently. Subnetworks with fewer than 15 members are colored
white; subnetworks without sequences of known or inferred function
are colored light gray. (B) Treemap illustrating the SSN with the
same coloring as part A. (C) Treemap comparing FDHs previously studied
as biocatalysts with FDHs investigated in this study. (D) Treemap
illustrating solubility of genome-mined enzymes in each subnetwork
of the SSN. Color gradient represents the fraction of enzymes within
the subnetwork that was soluble; diagonal bars indicate subnetworks
wherein no enzyme was tested. Treemaps illustrating the fraction of
enzymes in each subnetwork that were capable of chlorinating (E) or
brominating (F) at least one substrate in the high-throughput screen
(8% conversion threshold).
(A) Sequence-similarity network for flavin-dependent halogenases,
drawn at the less stringent edge detection threshold (≈30%
identity), colored according to subnetwork. Subnetworks within the
Indole and Phenol Subnetworks at the more stringent threshold are
colored differently. Subnetworks with fewer than 15 members are colored
white; subnetworks without sequences of known or inferred function
are colored light gray. (B) Treemap illustrating the SSN with the
same coloring as part A. (C) Treemap comparing FDHs previously studied
as biocatalysts with FDHs investigated in this study. (D) Treemap
illustrating solubility of genome-mined enzymes in each subnetwork
of the SSN. Color gradient represents the fraction of enzymes within
the subnetwork that was soluble; diagonal bars indicate subnetworks
wherein no enzyme was tested. Treemaps illustrating the fraction of
enzymes in each subnetwork that were capable of chlorinating (E) or
brominating (F) at least one substrate in the high-throughput screen
(8% conversion threshold).Subnetworks in SSNs can be explored in greater detail by increasing
the stringency of the sequence similarity required for edge detection.[39] The SSN drawn with ≈30% identity cutoff
for edge detection (Level 1) was examined with the identity cutoff
increased to ≈40% (Level 2). Functionally distinct subnetworks
within the Level 1 Indole Subnetwork became evident in the Level 2
SSN. All known tryptophan halogenases localized to only two relatively
small Level 2 subnetworks distinguished by their regioselectivity
(Figure B). All tryptophan
5-halogenases, such as PyrH, localized into one of these, and all
tryptophan 7-halogenases, including RebH and PrnA, were found in the
other subnetwork. Interestingly, tryptophan 6-halogenases were found
roughly evenly distributed between these two subnetworks. Only two
reports describe the substrate scopes of FDHs within the largest Level
2 subnetwork in the Indole Subnetwork, which demonstrated that some
enzymes in this subnetwork prefer bromination to chlorination.[45,46] The second largest subnetwork contained the sequence of the structurally
characterized but functionally uncharacterized protein 2PYX. Overall,
the sparse evaluation of enzymes within the Indole Subnetwork highlights
the fact that, even among proteins that are most similar to the well-characterized
tryptophan halogenases, there remains a vast amount of sequence space
to be explored.Closer inspection of the Level 1 Phenol Subnetwork
at the stricter
Level 2 identity cutoff shows subnetworks separated on the basis of
domain and whether the FDH natively halogenates a free small molecule
(variant A) or an acyl carrier protein-tethered small molecule (variant
B) (Figure C). The
largest subnetwork is composed entirely of eukaryotic sequences, and
all experimentally characterized proteins within the subnetwork, such
as Rdc2, are variant A halogenases. The second-largest subnetwork
in this group contains only bacterial sequences, many of which, including
VhaA,[66] are variant B halogenases that
catalyze chlorination in glycopeptide antibiotic biosynthesis.[49]
Expression of Genome-Mined Halogenases
The sequence-similarity
network outlined above was used as a framework to guide the selection
of a diverse set of novel FDHs from each subnetwork. The Phenol Subnetwork
was oversampled due to the high structural diversity of substrates
natively halogenated by known enzymes in this subnetwork. Other sequences
were sampled evenly from the rest of the SSN. Transcriptomic data
for sequences from fungi were analyzed using the JGI Mycocosm database
to prioritize the synthesis of sequences in order of sequence model
quality (see the SI).[87]Figure depicts the SSN and treemap representations summarizing the distribution
of different properties of enzymes in the different subnetworks.A total of 128 putative halogenase sequences and RebH as a control
were codon-optimized and coexpressed in pET28 with chaperones from
the plasmid pGro7 in Escherichia coli BL21(DE3) under
conditions found to be successful in expression of bacterial as well
as fungal halogenases.[34] A total of 87
new enzymes were obtained in sufficient soluble concentration for
functional characterization, but attempts to improve parallel expression
of the remaining enzymes did not lead to significant improvements
(Figure S55). Halogenases from throughout
the entire SSN could be expressed with good titers, but solubility
was not evenly distributed (Figure D). While 68% of enzymes were soluble, the Indole Subnetwork
provided a significantly higher fraction of soluble enzymes compared
to others (91%, 42 total). The halogenases in the Phenol Subnetwork
had much lower solubility (49% overall, 17 total), which was not significantly
influenced by the domain of the source organism (45% soluble for eukaryotic,
and 50% soluble for bacterial genes). An average number of Pyrrole
Subnetwork halogenases were soluble (71%, 5 total), while several
small subnetworks that were sampled provided no soluble halogenases
under the expression conditions tested.
Probe Substrate High-Throughput
Screen
The set of 87
diverse, soluble FDHs was subjected to a high-throughput activity
screen to evaluate which enzymes had detectable activity and, for
active enzymes, to develop substrate activity profiles to better understand
whether activity and subnetwork membership were related. For initial
activity screens, a set of 12 probe substrates—4 indoles, 4
anilines, and 4 phenols—was selected from among the substrates
previously found by our group to be reactive under FDH chlorination
conditions (Figure A). The key hypothesis governing selection of these substrates was
that their high inherent reactivity, reflected in their high calculated
halenium affinity values,[34,67] would lead to detectable
reactivity with active enzymes even if they exhibited poor binding
within FDH active sites. Structural variation within the panel was
used to facilitate the identification of viable substrates,[68] and substrates with multiple potentially reactive
sites were prioritized to increase the probability that reactive binding
poses could be achieved. Initial screens evaluated both chlorination
and bromination activities, the two most common halogenation reactions
catalyzed by FDHs. The probe substrate screens required at least 2088
independent experiments, not including replicates or controls. This
heavy screening requirement prompted us to adopt a high-throughput
LC-MS-based screen (Figure B), which also required that viable substrates ionize well
by ESI.[69−71] Using this method, analysis throughput of up to ≈11
s per reaction was achieved, and ultimately ≈20 000
experiments were analyzed.
Figure 3
(A) Probe substrates included in initial high-throughput
screen.
(B) Scheme summarizing LC-MS-based high-throughput screening method
employed.
(A) Probe substrates included in initial high-throughput
screen.
(B) Scheme summarizing LC-MS-based high-throughput screening method
employed.A total of 39 new halogenases
(45% of soluble enzymes) were able
to halogenate at least one of the probe substrates. Halogenation of
nearly the entire probe substrate panel was achieved by the genome-mined
set of enzymes; only formoterol was not halogenated by at least one
new halogenase. Overall, bromination activity was more prevalent than
chlorination activity. All genome-mined enzymes that were active had
brominase activity, but only 16% of the enzyme set had detectable
chlorinase activity. Activity was unevenly distributed across the
SSN; certain SSNs had a higher abundance of active enzymes than others
(Figure E–F).
The Indole Subnetwork had a particularly high percentage of active
enzymes; of the 42 Indole Subnetwork enzymes screened, 27 (64%) were
active. The fraction of active enzymes was similar for bacterial and
eukaryotic proteins, with 48% of bacterial and 56% of eukaryotic enzymes
screened having some activity on probe substrates. One of the three
viral proteins tested was active, and none of the six archaeal proteins
were active.The high-throughput screening conversion data for
each reaction
were plotted as a heatmap, and hierarchical clustering analysis was
used to characterize, separately, the similarity of activity profiles
for substrates and for FDHs (Figure ). Substrates tended to form clusters based on their
compound class, consistent with the observed similarity of “enzyme-scope”
of substrates within the same substrate class.[34] All phenols were present in two substrate clusters, one
containing only chlorination reactions and the other containing only
bromination reactions. Anilines and indoles were more mixed into the
remaining two clusters, but still distinguishable. One of these clusters
primarily included indole chlorination, dominated by the high indole
chlorination activity of RebH and a highly similar enzyme, 1-B12.
The other contained mostly aniline bromination reactions, the high
activity for which was more broadly distributed.
Figure 4
Heatmap of high-throughput
screening results, with hierarchical
clustering dendrograms for substrate/halide activity similarity and
enzyme activity similarity at the top and left, respectively. Substrate
functional groups and halide used in the reaction are color-coded
with bars at the tips of the dendrograms. Only reactions with >8%
conversion, a value selected that removed false positives (see the SI).
Heatmap of high-throughput
screening results, with hierarchical
clustering dendrograms for substrate/halide activity similarity and
enzyme activity similarity at the top and left, respectively. Substrate
functional groups and halide used in the reaction are color-coded
with bars at the tips of the dendrograms. Only reactions with >8%
conversion, a value selected that removed false positives (see the SI).Most importantly, enzymes in the same subnetwork tended to cluster
together based on their activity profiles. Four activity clusters
of enzymes (AC1–4) can be distinguished from the probe substrate
high-throughput screening data. AC1, at the top of the heatmap, contained
almost exclusively halogenases in the Indole Subnetwork. None of the
Indole Subnetwork enzymes in this AC were in either of the two Level
2 tryptophan subnetworks, however, and they were distinguished by
their preference for bromination of phenols and anilines. Despite
the fact that indoles are the most common substrates known to be halogenated
by enzymes in the Indole Subnetwork, halogenase activity on indoles
in AC1 was limited. Pindolol was the only indole halogenated by more
than one enzyme, and only a single FDH, 1-F08 (34% identical to SttH),
chlorinated more than one indole.Activity cluster 2 (AC2) had
similar bromination scope to AC1 but
had higher breadth of phenol chlorination activity. Only two of the
nine enzymes in this activity cluster were present in the Indole Subnetwork,
whereas four were in the Phenol Subnetwork. Halogenase 1-F11, from
an unannotated subnetwork within the Indole Subnetwork (38% identical
to tryptophan-5 halogenase ClaH[72]), and
2-C01, a halogenase in the same subnetwork as the lanthipeptide indole
halogenase MibH (36% identical[54]), were
capable of chlorinating multiple phenols, 2,4-dihydroxyacetophenone
(2,4-DHAP) and 7-hydroxycoumarin. The FDH 2-C01 was particularly versatile
in halide scope. UC-066, 7-hydroxycoumarin, and 2,4-DHAP were chlorinated
and brominated by 2-C01 with similar yields. Enzyme 1-F05, from the
Phenol Subnetwork (49% identical to ArmH4[73]), was similarly versatile in the halides it accepted, but its activity
was specific for phenolic probe substrates. It had the broadest phenol
substrate scope of any enzyme tested, but did not halogenate any aniline
or indole.Activity cluster 3 (AC3) was small and populated
by low-activity
enzymes only having bromination activity on the substrates that were
most easily halogenated. AC4 contained only two enzymes, RebH and
1-B12, that had the broadest substrate scope, particularly on indole
probe substrates. The high probe substrate scope of RebH was expected
by design, since the indoles and anilines of the probe panel were
assembled from substrates that were known to be chlorinated by RebH.
The enzyme 1-B12 has high sequence similarity to RebH (64% identical)
and a strongly similar substrate activity profile.
Activity and
Selectivity of Mined Halogenases toward Complex
Substrates
Based on the remarkable activity that our genome-mined
halogenases exhibited toward probe substrates, we next wondered whether
they might be capable of halogenating substrates that were not selected
from a set of easily halogenated compounds. A total of 50 larger and
more three-dimensionally complex additional substrates were selected
for these activity studies (Figure A). Among the compounds in this expanded substrate
set were yohimbine, a compound for which we previously evolved halogenase
activity from RebH,[20] and premalbrancheamide,
a compound natively halogenated by the FDH MalA.[38] Most of the substrates have not been reported as FDH substrates
previously, including β-estradiol 17-(β-d-glucuronide),
an estrogen metabolite, and cabergoline, an ergot alkaloid. A total
of 48% of the more complex substrates tested were halogenated by at
least one halogenase under the nonoptimized conditions used in the
high-throughput screen (Figure B). Hierarchical clustering was performed on the reaction
data as above. However, the similarities between enzymes were substantially
lower than in the clustering analysis of the probe substrate data,
and activity clusters were consequently less defined.
Figure 5
(A) Representative compounds
included in expanded high-throughput
substrate screen, each of which was halogenated by at least one genome-mined
FDH. (B) Heatmap of expanded substrate screen data with 10 of the
most active enzymes from the probe high-throughput substrate screen.
(A) Representative compounds
included in expanded high-throughput
substrate screen, each of which was halogenated by at least one genome-mined
FDH. (B) Heatmap of expanded substrate screen data with 10 of the
most active enzymes from the probe high-throughput substrate screen.Larger quantities of several
of the most active
genome-mined FDHs were expressed, purified, and used for preparative-scale
bioconversions on a subset of the larger substrates evaluated (Figure ). Premalbrancheamide
is a compound natively dichlorinated by MalA at C5 and C6 and has
been shown to be halogenated at the C5 or C6 positions nonselectively
using either chloride or bromide as halide sources.[38] The Indole Subnetwork FDH 1-F08 preferentially brominates
premalbrancheamide at C5 in 51% isolated yield, and it also brominates
AZ20, a selective ATR kinase inhibitor, in 28% isolated yield at the
indole C3 position. A different Indole Subnetwork enzyme, 1-F11, was
capable of brominating β-estradiol 17-(β-d-glucuronide),
an estradiol metabolite, at the 4-position of the steroid in 57% yield
and the 1-position of the carvedilol carbazole ring in 56% isolated
yield.
Figure 6
Preparative-scale bioconversions of larger substrates.
Preparative-scale bioconversions of larger substrates.Many examples of reactions in which different regioisomers
were
formed by different enzymes were also identified (Figure ). Pindolol, which is brominated
at C7 by RebH variants,[21] is also brominated
at C7 by the Indole Subnetwork FDH 1-F11. The MibH subnetwork enzyme
2-C01, on the other hand, preferentially brominates at C2. This finding
is notable since C2 is less electronically activated than C7 based
on its 3 kcal/mol lower halenium affinity (HalA), a metric for computationally
evaluating the reactivity of different positions of a molecule toward
EAS.[67] Naringenin is brominated at two
different positions using 1-F11 or Phenol Subnetwork enzyme 1-F05.
Despite the negligible energetic differences in HalA for C6 and C8
(0.7 kcal/mol), 1-F05 was found to be >95% selective for C8, while
1-F11 and other FDHs were found to have only minor preferences in
regioselectivity for C8 or C6. Trp-6,7 halogenase subnetwork enzyme
1-B12 halogenates the indole-containing compound methylergonovine
at C7, which has a halenium affinity 4.2 kcal/mol lower than C2, the
most nucleophilic aromatic C–H site on this compound. The FDH
2-C01, on the other hand, brominates methylergonovine at C2.
Figure 7
Regiocomplementary
halogenation of large molecules.
Regiocomplementary
halogenation of large molecules.
Discussion
Family-Wide View of FDH Properties
Family-wide analysis
of FDHs revealed several notable trends that are not apparent from
prior studies. First, FDHs from diverse host organisms can be solubly
expressed without significant optimization of expression conditions.
Bacterial enzymes had the highest soluble expression success rate
(76%), while a lower fraction (40%) of eukaryotic enzymes were soluble.
Notably, however, the lower fraction of soluble eukaryotic FDHs reflects
the poorer solubility of halogenases in the Phenol Subnetwork regardless
of host organism domain. Nearly all (20/23) of the eukaryotic proteins
evaluated were within the Phenol Subnetwork. Within this subnetwork,
the soluble expression rate is generally low, but it is actually higher
for proteins from eukaryotes (54%) relative to bacteria (44%). This
finding indicates that eukaryotic FDHs can be readily expressed in E. coli and that genome mining efforts should be encouraged
to include enzymes from eukaryotic species.[74]Second, halogenase activity was also evenly distributed between
enzymes from bacterial and eukaryotic organisms (48% and 56% active,
respectively). FDH activity was not observed for any archaeal proteins
evaluated, consistent with the strong possibility that most if not
all archaeal sequences in the SSN are geranylgeranyl reductases. Interestingly,
one viral FDH, a cyanophage auxiliary metabolic gene product,[75] was active, though its activity and substrate
scope were low (conversion of <35% on only three probe substrates
was observed). In general, the identification of such a high percentage
of active halogenases, despite the use of non-native substrates for
activity profiling and a lax homology requirement for evaluation (E-value threshold of 10–5), suggests that
this family contains a large number of enzymes suitable for biocatalysis.Third, bromination activity was much more widespread than chlorination
activity within the FDHs surveyed. The majority of the FDH biocatalysis
literature focuses on chlorination activity because most FDHs reported
to date are involved in the biosynthesis of chlorinated natural products.
Moore[76] has reported three flavin-dependent
brominases involved in the biosynthesis of brominated natural products,
but these are more distantly related to enzymes comprising the SSN
in the current study. These brominases have 17 ± 4% sequence
identity to enzymes in the SSN; for comparison, RebH exhibits 29 ±
10% sequence identity to our genome-mined enzymes. Sewald[45,46] reported flavin-dependent halogenases (contained in the Indole Subnetwork
of the FDH SSN) that prefer bromide over chloride when acting on the
(presumably) non-native substrate indole. While this observation was
taken to indicate specificity of these enzymes toward bromide, our
findings indicate that a preference for bromination is common in FDHs.
We suggest that the higher electrophilicity of bromine relative to
chlorine in heteroatom-X species,[77] such
as the proposed hypohalous acid or haloamine halogenating agents in
FDH catalysis, leads to more facile bromination. For example, the
native chlorinase RebH can brominate a greater range of non-native
substrates than it can chlorinate. Preference for bromination over
chlorination for non-native as well as native substrates is also observed
when both Cl– and Br– are present
in solution. In competition reactions including both NaCl and NaBr,
RebH prefers bromide over chloride for L-tryptophan, 1-phenylpiperazine,
pindolol, and 2,4-dihydroxyacetophenone halogenation.[64] It is therefore possible that enzymes with higher bromination
than chlorination scope in our high-throughput screen could nevertheless
natively catalyze chlorination reactions.
Analyzing FDH Activity
Using Sequence-Similarity Networks and
Activity Clustering
Sequence-similarity networks provide
an intuitive structure for exploring the protein sequence space of
enzyme families. The FDH SSN contains Level 1 subnetworks comprising
enzymes with similar native substrate preferences (indole vs phenol,
etc.). At a more stringent identity threshold cutoff, Level 2 subnetworks
with finer functional distinction are revealed. Within the Level 1
Phenol Subnetwork, for example, different Level 2 subnetworks containing
primarily either variant A or variant B halogenases, which natively
halogenate free small molecules or ACP-tethered substrates, respectively,
are observed. The ability to distinguish such enzyme subclasses based
on sequence alone is useful for focusing future genome mining efforts
since our data indicate that neither of the variant B phenol halogenases
examined were even soluble. Information on site selectivity could
also be obtained directly from sequence information in some cases.
For example, within the Level 1 Indole Subnetwork, separation of tryptophan
5- and 7-halogenases into distinct Level 2 subnetworks was apparent,
though tryptophan 6-halogenases were roughly evenly distributed between
these subgroups.Only six of the Level 1 subnetworks (Figure A) examined contained
enzymes with measurable chlorination or bromination activity on our
probe substrate set, but these subnetworks contained 78% of the FDHs
within the SSN. Specifically, enzymes in the Indole Subnetwork (66%),
the Phenol Subnetwork (42%), the Pyrrole Subnetwork (2/5), subnetwork
4 (2/2), subnetwork 8 (1/2), and the MibH subnetwork (1/1) were active.
These findings reflect the nature of the probe substrates chosen,
but given the range of substrates examined and the similarity of these
substrates to pharmaceuticals and other fine chemicals, they also
highlight regions of FDH sequence space most likely to be of interest
for biocatalysis.Active enzymes were found in most Level 2
subnetworks that comprise
the Level 1 Indole Subnetwork (Figure B). For example, all enzymes in both tryptophan halogenase
subnetworks were active, as was the only enzyme in subnetwork 21.
Most enzymes in the BrvH halogenase subnetwork (84%), three out of
four enzymes in subnetwork 2PYX, and two of three enzymes (1-C08 and
1-F11) in subnetwork 9 were active. The activity results within the
Indole Subnetwork broadly show that a high fraction of these enzymes
have potential as useful biocatalysts and highlight several underexplored
regions in the FDH sequence space that merit further investigation.Analysis of Level 2 subnetworks within the Level 1 Phenol Subnetwork
also highlights regions with high potential for biocatalyst identification.
The majority of the tested enzymes in the variant A subnetwork, including
1-F05, were active (66%), but both of the variant B halogenases were
insoluble under the conditions examined. The only soluble genome-mined
enzyme in the large phenol halogenase subnetwork containing XanH was
active. The single evaluated enzyme in the NapH2 subnetwork was inactive,
as were the six soluble enzymes that were either singletons or within
small (<15 members) subnetworks. Overall, the variant A subnetwork
within the Phenol Subnetwork shows clear promise as a source of novel
biocatalysts, but further study of other subnetworks would be required
to get a clearer picture of their potential.Finally, functional
characterization of enzymes across the FDH
SSN demonstrated that enzymes within a Level 1 subnetwork have similar
activity profiles on smaller probe substrates but that this trend
diminishes on more complex substrates. Not surprisingly, more closely
related enzymes possess more similar activity profiles. Highly similar
substrate activity profiles are observed for RebH and 1-B12 (64% identical),
both of which are within the Level 2 tryptophan 6,7-halogenase subnetwork,
as well as for halogenases 1-H11 and 1-F10 (50% identical), both of
which are in the Level 2 BrvH subnetwork. These trends suggest an
approximate %ID threshold for future genome-mining of new halogenases
with similar substrate scopes. Because the gene selection process
of this study intentionally favored diverse sequences to maximize
the breadth of our search for new halogenases, however, there are
few instances of such similar enzyme pairs in which both were soluble
and highly active. The average %ID for the most closely related enzyme
within the genome-mined set was 41.2 ± 12.4%, perhaps too low
for similarities in activity profiles among enzymes to result in consistent
trends. More thorough genome mining of subnetworks with highly active
FDHs could yield more concrete activity profiles and reveal more detailed
information regarding enzyme substrate preferences.
Unique Activity
and Selectivity of Mined Halogenases
The selectivity of FDHs
on their native substrates has driven interest
in these enzymes as biocatalysts. The potential of FDHs has been explored
by researchers seeking to extend their synthetic utility toward gram-scale
synthesis,[24] more facile cross-coupling
chemistry,[21−23] synthesis of enantioenriched products,[78,79] and diversification of natural product biosynthetic pathways.[80−82] Other work has sought to make operation more economical, including
efforts toward improved cofactor regeneration.[24,83] Because a given FDH may not provide the selectivity required for
a particular application, however, a number of laboratories have explored
the use of targeted mutations to alter FDH selectivity. While grafting
key residues from one tryptophan halogenase into another has been
used to switch selectivity on tryptophan,[29] modest selectivity has generally been reported for efforts focused
on non-native substrates (e.g., converting SttH from 90% 6-selective
to 75% 5-selective for 3-indolepropionic acid chlorination).[30,84] To address this issue, our lab established that directed evolution
can be used to generate FDHs with high (>90%) regioselectivity
for
different sites on a single substrate (tryptamine), and that the resulting
enzymes also had altered selectivity on a range of other substrates.[28] Several rounds of evolution were required to
achieve this goal, so accelerating the identification of FDHs with
complementary regioselectivity on non-native substrates remains an
important goal.Gratifyingly, a number of enzymes identified
in our family-wide survey of FDH activity exhibited regiocomplementarity
on a number of structurally complex substrates. For example, the enzyme
2-C01 often provided different regiochemical outcomes as compared
to other halogenases. This FDH is present in a subnetwork along with
MibH, which natively chlorinates a tryptophan indole ring in a large
lanthipeptide.[54] MibH has a large, hydrophobic
binding pocket in order to accommodate its native substrate. The genome-mined
halogenase 2-C01 may have a similar active site, which could accommodate
large substrates in distinct binding poses. This finding suggests
that 2-C01 could be a promising starting point for evolving FDHs to
achieve late-stage functionalization of aryl C–H bonds with
distinct regioselectivity relative to other FDHs characterized to
date.
Comparison of the Genome-Mined Halogenase Library with Evolved
Variants
Enzymes frequently require substantial modification
before they are capable of being deployed as useful catalysts for
organic synthesis. As the regiocomplementarity noted above illustrates,
access to a diverse pool of enzymes that can serve as starting points
for directed evolution can greatly expedite biocatalyst identification.
While evolving a single enzyme can take a great deal of effort and
may ultimately fail to provide the desired levels of improvement,
a related enzyme may be better suited initially to the task and can
drastically reduce the effort required to obtain a desired biocatalyst.
This point can be retrospectively illustrated by several enzymes in
our genome-mined set, which perform comparably to evolved RebH variants
with increased thermal stability,[27] expanded
substrate scope,[20] and altered regioselectivity.[28]Halogenase 1-F11 is notable in this regard.
FDH biocatalysis is often hampered by low protein expression yields;[56] therefore, a more soluble starting enzyme for
FDH directed evolution would be especially attractive. The expression
yield of 1-F11 was 125 ± 30 mg/L from a 50 mL expression culture,
higher than that of RebH, the expression of which yielded 54 ±
21 mg/L enzyme under analogous expression conditions (Figure A). Higher halogenase activity
in lysate on numerous substrates is also observed for 1-F11 compared
with RebH.[64] Despite originating from a
mesophilic sphingomonas species within an Arabidopsis thaliana root microbiome, 1-F11 has comparable thermal stability (Tm = 66.5 ± 0.2 °C) to RebH variant
3-LSR (Tm = 69.5 ± 0.4 °C, Figure B), which was evolved
over three rounds of directed evolution for improved thermal stability.[27] Since more stable enzymes can have a longer
catalytic lifetime and can better tolerate random mutations,[86] 1-F11 provides a convenient starting point for
directed evolution. 1-F11 also compares favorably in substrate scope
with the RebH mutant 4V, which was evolved over four rounds of directed
evolution for the late-stage C–H functionalization of yohimbine,
a complex, biologically active molecule.[20] Because RebH has minimal activity on yohimbine, a substrate-walking
directed evolution approach was required to evolve an enzyme that
could halogenate this compound. Enzyme 1-F11, on the other hand, was
capable of brominating yohimbine without any modification through
directed evolution (Figure C). In short, 1-F11 possesses capabilities that took a total
of seven rounds of directed evolution using two different approaches
to accomplish, highlighting the benefits of family-wide genome mining
for biocatalyst identification. Moreover, given the broad substrate
scope of 1-F11, we envision it could be an ideal starting point for
further directed evolution.
Figure 8
(A) Comparison of isolated RebH and 1-F11 protein
yields after
Ni-NTA purification from 50 mL expression cultures. (B) Comparison
of CD thermal melts of RebH, thermostable RebH variant 3-LSR, and
genome-mined halogenase 1-F11. Curves shown are the best fit for thermal
unfolding monitored at 222 nm using CDPal.[85] (C) Wild-type RebH required several rounds of directed evolution
before yohimbine halogenation was detectable. Halogenase 1-F11 can
halogenate yohimbine without directed evolution (HPLC conversion shown).
(A) Comparison of isolated RebH and 1-F11 protein
yields after
Ni-NTA purification from 50 mL expression cultures. (B) Comparison
of CD thermal melts of RebH, thermostable RebH variant 3-LSR, and
genome-mined halogenase 1-F11. Curves shown are the best fit for thermal
unfolding monitored at 222 nm using CDPal.[85] (C) Wild-type RebH required several rounds of directed evolution
before yohimbine halogenation was detectable. Halogenase 1-F11 can
halogenate yohimbine without directed evolution (HPLC conversion shown).
Safety Statement
No uncommon safety
risks were encountered
while conducting the described research.
Conclusions
FDHs
were first characterized in the mid-1990s. Since this time,
most of the FDHs reported have come from either studies on individual
biosynthetic pathways or genome mining efforts targeting specific
organisms or metagenomic samples.[19]Figure C and Figure E,F illustrate how these efforts
have focused on a remarkably narrow range of FDH sequence space and
missed out on large swaths of this space that contain functional enzymes,
respectively. Family-wide activity analysis shows that similar fractions
of FDHs from bacteria and fungi are soluble and active and that bromination
is more commonly observed than chlorination. Broader sampling of this
space has not only led to the identification of new enzymes with unique
catalytic properties but also highlighted regions of sequence space
that are ripe for further exploration. As noted above, other regions
might also be suitable for different types of substrates than those
examined herein, but for the electron-rich aromatic compounds explored
to date, these regions are clearly privileged. Moreover, the SSNs
and substrate activity profiles developed in this study offer a predictive
ability for focusing biocatalyst selection or further genome mining
efforts for particular applications. Extending this approach, involving
SSN-guided selection of sequences from throughout an enzyme family,
label-free high-throughput mass spectrometry screening using synthetic
probe substrates, and activity profiling, to other enzymes has great
potential for expediting biocatalyst identification.Beyond
these family-wide findings, a number of remarkably useful
enzymes were identified in the representative set that was explored.
Particularly notable in this regard are 1-F11, 1-F08, 2-C01, 1-F05,
and 1-B12. Collectively, these enzymes enable C–H halogenation
of previously inaccessible substrates, provide complementary site
selectivity on complex biologically active substrates, and exhibit
improved thermostability relative to a commonly reported FDH. Their
sequences also differ significantly from other FDHs that have been
explored in vitro. With the exception of 1-B12, which
is 64% identical to RebH, they are only 34–43% identical to
FDHs that have been explored as biocatalysts. These novel and diverse
halogenases therefore represent promising starting points for both
directed evolution and additional genome mining aimed at identifying
similarly effective biocatalysts. The activity of these enzymes on
complex natural products and pharmaceuticals also suggests that their
native substrates could be similarly fascinating structures. It could
therefore be interesting to examine the native function of these enzymes,
reversing the enzymology-to-biocatalysis progression that has dominated
biocatalyst development to date.[4] This
approach could provide a unique means of identifying new halogenated
natural products and other unique compounds when extended to other
enzyme classes.
Authors: Dennis Wetzl; Marco Berrera; Nicolas Sandon; Dan Fishlock; Martin Ebeling; Michael Müller; Steven Hanlon; Beat Wirz; Hans Iding Journal: Chembiochem Date: 2015-07-02 Impact factor: 3.164
Authors: John A Gerlt; Jason T Bouvier; Daniel B Davidson; Heidi J Imker; Boris Sadkhin; David R Slater; Katie L Whalen Journal: Biochim Biophys Acta Date: 2015-04-18
Authors: James T Payne; Paul H Butkovich; Yifan Gu; Kyle N Kunze; Hyun June Park; Duo-Sheng Wang; Jared C Lewis Journal: J Am Chem Soc Date: 2018-01-08 Impact factor: 15.419
Authors: Manuel A Ortega; Dillon P Cogan; Subha Mukherjee; Neha Garg; Bo Li; Gabrielle N Thibodeaux; Sonia I Maffioli; Stefano Donadio; Margherita Sosio; Jerome Escano; Leif Smith; Satish K Nair; Wilfred A van der Donk Journal: ACS Chem Biol Date: 2017-01-13 Impact factor: 5.100
Authors: Warren J L Wood; Andrew W Patterson; Hiroyuki Tsuruoka; Rishi K Jain; Jonathan A Ellman Journal: J Am Chem Soc Date: 2005-11-09 Impact factor: 15.419
Authors: Mary C Andorfer; Jonathan E Grob; Christine E Hajdin; Julia R Chael; Piro Siuti; Jeremiah Lilly; Kian L Tan; Jared C Lewis Journal: ACS Catal Date: 2017-01-31 Impact factor: 13.084
Authors: Krysten A Jones; Harrison M Snodgrass; Ketaki Belsare; Bryan C Dickinson; Jared C Lewis Journal: ACS Cent Sci Date: 2021-09-13 Impact factor: 14.553