Literature DB >> 31807686

Site-Selective C-H Halogenation Using Flavin-Dependent Halogenases Identified via Family-Wide Activity Profiling.

Brian F Fisher¹, Harrison M Snodgrass¹, Krysten A Jones², Mary C Andorfer³, Jared C Lewis¹.

Abstract

Enzymes are powerful catalysts for site-selective C-H bond functionalization. Identifying suitable enzymes for this task and for biocatalysis in general remains challenging, however, due to the fundamental difficulty of predicting catalytic activity from sequence information. In this study, family-wide activity profiling was used to obtain sequence-function information on flavin-dependent halogenases (FDHs). This broad survey provided a number of insights into FDH activity, including halide specificity and substrate preference, that were not apparent from the more focused studies reported to date. Regions of FDH sequence space that are most likely to contain enzymes suitable for halogenating small-molecule substrates were also identified. FDHs with novel substrate scope and complementary regioselectivity on large, three-dimensionally complex compounds were characterized and used for preparative-scale late-stage C-H functionalization. In many cases, these enzymes provide activities that required several rounds of directed evolution to accomplish in previous efforts, highlighting that this approach can achieve significant time savings for biocatalyst identification and provide advanced starting points for further evolution.

Entities: CellLine Chemical Disease Gene

Year: 2019 PMID： 31807686 PMCID： PMC6891866 DOI： 10.1021/acscentsci.9b00835

Source DB: PubMed Journal: ACS Cent Sci ISSN： 2374-7943 Impact factor: 14.553

Introduction

Enzymes can be powerful tools for the synthesis of fine chemicals, pharmaceuticals,[1] agrochemicals, and many other materials.[2] On the other hand, the very features that give rise to the selectivity and catalytic proficiency of enzymes acting on their native substrates often lead to high substrate specificity and thus poor activity on non-native substrates. The dearth of enzymes available for reactions of interest can therefore be a major impediment to implementing enzymes in synthetic routes. Many of the enzymes commonly used today (e.g., ketoreductases, transaminases, cytochromes P450, etc.) were originally identified via arduous enzymology aimed at clarifying their native biological activities. Often, many rounds of directed evolution were also required to optimize these enzymes for synthetic applications. Expanding the number of known biocatalysts, with an emphasis on exploring broad sequence diversity within an enzyme family, could therefore greatly facilitate the use of enzymes in chemical synthesis.[3,4] Numerous methods have been used to explore the functional diversity of naturally occurring enzymes in discrete genomes, metagenomic samples, and sequence databases.[5] Advances in DNA sequencing—in particular, metagenome sequencing—have resulted in an explosion of the size of protein sequence databases.[6] Coupled with the decreasing cost of gene synthesis,[7] mining these sequence databases for potential biocatalysts is becoming increasingly accessible to scientists.[8] Such approaches are most commonly used to identify enzymes that act on a substrate of interest, often a chromogenic probe compound chosen more for ease of screening than for synthetic utility,[9,10] but efforts to profile the activity of entire enzyme families on a range of substrates and identify biocatalysts with a collectively broad substrate scope are less common. Family-wide profiling efforts include investigations on phosphatases,[11] metallo-β-lactamases,[9,12] and glutathione-S-transferases.[13] Studies on dehalogenases,[14] esterases,[15] glycosyl transferases,[16] and imine reductases[17] highlight the potential synthetic utility of enzymes identified from such efforts. Comparable genome mining efforts on enzymes that functionalize C–H bonds have not been reported. Flavin-dependent halogenases (FDHs), which catalyze site-selective C–H halogenation of electron-rich aromatic compounds, have been studied extensively due to their potential synthetic utility.[18,19] Late-stage functionalization,[20] sequential halogenation/cross-coupling,[21−23] and preparative-scale halogenation[24] have all been accomplished using these enzymes. Our efforts have focused primarily on RebH, an FDH that was identified in studies aimed at elucidating the biosynthetic pathway of the antitumor compound rebeccamycin.[25] In this context, RebH catalyzes site-selective chlorination of tryptophan, and it has since been shown to halogenate a range of indoles and anilines.[26] Our group has also shown that directed evolution can be used to create RebH variants with improved thermal stability,[27] high activity on large, biologically active compounds,[20] and high selectivity for different sites on target compounds.[28] While effective, these efforts required 3–10 rounds of directed evolution due to the wild-type enzyme’s modest stability, low activity on large substrates, and high regioselectivity. While additional FDHs could therefore expand the utility of these enzymes for synthesis, only a relatively narrow set of FDHs have been investigated for biocatalysis.[19] FDHs that catalyze tryptophan chlorination, such as RebH, Thal,[29] SttH,[30] and PrnA,[31] in particular are over-represented. Fungal halogenases, such as Rdc2,[32] RadH,[33] and GsfI,[34] which natively chlorinate phenol-containing substrates, have also been shown to be active and selective biocatalysts. Literature reports on the collective substrate scopes of the FDHs reported to date suggest that they prefer chloride over other halides and that they act on electron-rich compounds similar to their native substrate.[19] On the other hand, the existence of a range of complex halogenated natural products distinct from those produced by well-characterized biosynthetic gene clusters[35] implies that FDHs with unique substrate scopes might be found in less characterized halogenase subgroups.[36] We hypothesized that exploring uncharacterized FDHs found in protein sequence databases could, together with currently characterized enzymes, form a diverse starting toolkit for selective late-stage C–H halogenation. Herein, we describe the use of a high-throughput mass-spectrometry-based screen to evaluate a broad set of over 100 putative FDH sequences drawn from throughout the FDH family. Halogenases with novel substrate scope and complementary regioselectivity on large, three-dimensionally complex compounds were identified. This effort involved far more extensive sequence–function analysis than has been accomplished using the relatively narrow range of FDHs characterized to date, providing a clearer picture of the regions in FDH sequence space that are most likely to contain enzymes suitable for halogenating small-molecule substrates. The representative enzyme panel constructed in this study also provides a rapid means to identify FDHs for lead diversification via late-stage C–H functionalization. In many cases, these enzymes provide activities that required several rounds of directed evolution to accomplish in previous efforts, highlighting that this approach can achieve significant time savings for biocatalyst identification and provide advanced starting points for further evolution.

Results

Organization of Halogenase Sequence-Similarity Network

A BLAST search of the UniProt sequence database using RebH as a query sequence and an E-value threshold of 10–5 generated 3975 unique hits spanning a range of sequence and host diversity, including bacterial, archaeal, eukaryotic, and viral proteins. Nearly all (>90%) previously reported FDHs are present in this set. The dinucleotide-binding GxGxxG motif, characteristic of FAD-binding proteins, is present in 92% of the sequences, and the WxWxIP motif,[37] characteristic of FDHs but absent in flavin-dependent monooxygenases, is found in 69% of the sequences. The latter value increases to 78% when motif variants WxWxI[R,G][38] are included. Collectively, these analyses suggest that the majority of the sequences examined are likely FDHs. Sequence-similarity networks (SSNs)[39] were then used to visualize functional relationships among putative FDH sequences. In this representation, protein sequences are illustrated as nodes in a network graph that are connected by edges (lines) to other sequences that exceed a specified pairwise sequence similarity. An SSN was generated for the entire FDH sequence set with a permissive edge detection threshold (corresponding to ≈30% sequence identity) using the Enzyme Function Initiative-Enzyme Similarity Tool (EFI-EST).[40,41] Previously reported data for 129 known enzymes found among the BLAST hits were mapped onto this Level 1 SSN to explore subnetwork colocalization of enzyme properties. The clearest defining features of the individual subnetworks are host domain and compound class—indole, phenol, or pyrrole—of native substrates for known FDHs within the subnetworks (Figure A), the latter suggesting that the SSN might provide a framework for identifying enzymes that act on specific compound classes and for surveying regions of sequence space where substrate preference is unknown.

Figure 1

(A) Sequence-similarity network for flavin-dependent halogenases. Each circle is a representative node, grouping protein sequences with >50% sequence identity as determined by CD-HIT.[65] Edge detection threshold set at alignment score of 70 (≈30% sequence identity). Nodes are filled according to native substrate functional group of at least one sequence in the representative node; colored stroke indicates domain (thin black stroke = bacterial). Subnetworks with ≥15 sequences but without any characterized protein are labeled numerically. Level 2 subnetworks formed from the Indole (B) and Phenol (C) subnetwork using a stricter alignment score cutoff of 140 (≈40% sequence identity). Level 2 subnetworks are labeled based on known sequences in the subnetwork. For Indole Subnetwork sequences, nodes containing known tryptophan halogenases are filled according to their regioselectivity, and subnetworks with ≥15 sequences are labeled numerically. For Phenol Subnetwork sequences, nodes are filled according to the halogenase variant type (A = free small molecule native substrate, B = ACP-tethered native substrate). The largest subnetwork, comprising 2270 sequences, contains FDHs that either natively halogenate tryptophan or have been shown to catalyze indole halogenation in vitro. All known tryptophan FDHs are found in this Indole Subnetwork, including tryptophan 5-, 6-, and 7-halogenases PyrH,[42] SttH,[43] and RebH.[44] BrvH, a halogenase identified from metagenomic analysis,[45] and three recently reported halogenases from Xanthomonas campestris(46) are also in this subnetwork. Although the native substrates of these enzymes are not known, they have been shown to halogenate a variety of small indoles. A protein whose structure has been determined as part of structural genomics efforts (PDB: 2PYX)[47] is also present in this subnetwork, although its native activity is also unknown. The second largest subnetwork comprises 438 sequences from bacteria and fungi and includes most known phenol FDHs that, collectively, halogenate a diverse range of phenol-containing substrates. For example, the bacterial halogenase TiaM chlorinates a large macrocyclic intermediate in the biosynthesis of tiacumicin B.[48] Bacterial halogenases VhaA and Tcp21 chlorinate PCP-tethered amino acids in the biosynthesis of the NRPS glycopeptide antibiotics vancomycin and teicoplanin, respectively.[49] The putative bacterial iodinase CalO3 is also present in the Phenol Subnetwork,[50] showcasing that substrate diversity also extends to halide specificity. All fungal phenol FDHs that have been studied as biocatalysts on diverse substrates, including Rdc2,[32] RadH,[33] and GsfI,[34] are also contained in this subnetwork. The third largest subnetwork, with 212 sequences, contains FDHs that are involved in chlorinated pyrrole natural product biosynthesis. The six pyrrole halogenases in the Pyrrole Subnetwork have an average pairwise identity of 87%, and all are annotated in UniProt as PrnC, which halogenates a pyrrole small-molecule intermediate in pyrrolnitrin biosynthesis.[51] The halogenase PltM natively halogenates a phenolic substrate, phloroglucinol, to produce chlorinated compounds that induce biosynthesis of a pyrrole-containing natural product, pyoluteorin.[52] Two proteins, Dox16 and Dox17, potentially halogenate phenolic moieties during the biosynthesis of pyrrolomycins,[53] pyrrole-containing compounds structurally similar to pyrrolnitrin. These observations suggest that the common ancestor to enzymes in this subnetwork diverged in substrate specificity to yield halogenases specialized for distinct roles in chlorinated pyrrole natural product biosynthesis (Figure S18). Most of the smaller subnetworks contain only uncharacterized proteins (and are therefore simply assigned a number for reference in Figures or 2), but several include known FDHs with diverse native substrates. One small subnetwork contains several enzymes, including MibH,[54] MscL,[55] and KrmI,[56] that natively chlorinate peptidyl tryptophan side chains in macrocyclic lanthipeptide and NRPS natural products. Other subnetworks include MalA and MalA′, which are responsible for iterative chlorination in the biosynthesis of malbrancheamide,[38] ChlA, which chlorinates a phenol in DIF-1 biosynthesis,[57] and GetL, an enzyme suspected to be responsible for chlorinating PCP-tethered histidine in the biosynthesis of tetrapeptide antibiotics.[58] Halogenases responsible for chlorinating ACP-tethered pyrroles (variant B pyrrole halogenases) such as PltA[59,60] and Mpy16[61] occupied a subnetwork distinct from the larger subnetwork that included the variant A pyrrole halogenase PrnC. A few subnetworks contain enzymes that are not FDHs, including the flavin-dependent monooxygenases Qhpg[62] and LodB[63] and putative geranylgeranyl reductases.[64]

Figure 2

(A) Sequence-similarity network for flavin-dependent halogenases, drawn at the less stringent edge detection threshold (≈30% identity), colored according to subnetwork. Subnetworks within the Indole and Phenol Subnetworks at the more stringent threshold are colored differently. Subnetworks with fewer than 15 members are colored white; subnetworks without sequences of known or inferred function are colored light gray. (B) Treemap illustrating the SSN with the same coloring as part A. (C) Treemap comparing FDHs previously studied as biocatalysts with FDHs investigated in this study. (D) Treemap illustrating solubility of genome-mined enzymes in each subnetwork of the SSN. Color gradient represents the fraction of enzymes within the subnetwork that was soluble; diagonal bars indicate subnetworks wherein no enzyme was tested. Treemaps illustrating the fraction of enzymes in each subnetwork that were capable of chlorinating (E) or brominating (F) at least one substrate in the high-throughput screen (8% conversion threshold). Subnetworks in SSNs can be explored in greater detail by increasing the stringency of the sequence similarity required for edge detection.[39] The SSN drawn with ≈30% identity cutoff for edge detection (Level 1) was examined with the identity cutoff increased to ≈40% (Level 2). Functionally distinct subnetworks within the Level 1 Indole Subnetwork became evident in the Level 2 SSN. All known tryptophan halogenases localized to only two relatively small Level 2 subnetworks distinguished by their regioselectivity (Figure B). All tryptophan 5-halogenases, such as PyrH, localized into one of these, and all tryptophan 7-halogenases, including RebH and PrnA, were found in the other subnetwork. Interestingly, tryptophan 6-halogenases were found roughly evenly distributed between these two subnetworks. Only two reports describe the substrate scopes of FDHs within the largest Level 2 subnetwork in the Indole Subnetwork, which demonstrated that some enzymes in this subnetwork prefer bromination to chlorination.[45,46] The second largest subnetwork contained the sequence of the structurally characterized but functionally uncharacterized protein 2PYX. Overall, the sparse evaluation of enzymes within the Indole Subnetwork highlights the fact that, even among proteins that are most similar to the well-characterized tryptophan halogenases, there remains a vast amount of sequence space to be explored. Closer inspection of the Level 1 Phenol Subnetwork at the stricter Level 2 identity cutoff shows subnetworks separated on the basis of domain and whether the FDH natively halogenates a free small molecule (variant A) or an acyl carrier protein-tethered small molecule (variant B) (Figure C). The largest subnetwork is composed entirely of eukaryotic sequences, and all experimentally characterized proteins within the subnetwork, such as Rdc2, are variant A halogenases. The second-largest subnetwork in this group contains only bacterial sequences, many of which, including VhaA,[66] are variant B halogenases that catalyze chlorination in glycopeptide antibiotic biosynthesis.[49]

Expression of Genome-Mined Halogenases

The sequence-similarity network outlined above was used as a framework to guide the selection of a diverse set of novel FDHs from each subnetwork. The Phenol Subnetwork was oversampled due to the high structural diversity of substrates natively halogenated by known enzymes in this subnetwork. Other sequences were sampled evenly from the rest of the SSN. Transcriptomic data for sequences from fungi were analyzed using the JGI Mycocosm database to prioritize the synthesis of sequences in order of sequence model quality (see the SI).[87]Figure depicts the SSN and treemap representations summarizing the distribution of different properties of enzymes in the different subnetworks. A total of 128 putative halogenase sequences and RebH as a control were codon-optimized and coexpressed in pET28 with chaperones from the plasmid pGro7 in Escherichia coli BL21(DE3) under conditions found to be successful in expression of bacterial as well as fungal halogenases.[34] A total of 87 new enzymes were obtained in sufficient soluble concentration for functional characterization, but attempts to improve parallel expression of the remaining enzymes did not lead to significant improvements (Figure S55). Halogenases from throughout the entire SSN could be expressed with good titers, but solubility was not evenly distributed (Figure D). While 68% of enzymes were soluble, the Indole Subnetwork provided a significantly higher fraction of soluble enzymes compared to others (91%, 42 total). The halogenases in the Phenol Subnetwork had much lower solubility (49% overall, 17 total), which was not significantly influenced by the domain of the source organism (45% soluble for eukaryotic, and 50% soluble for bacterial genes). An average number of Pyrrole Subnetwork halogenases were soluble (71%, 5 total), while several small subnetworks that were sampled provided no soluble halogenases under the expression conditions tested.

Probe Substrate High-Throughput Screen

The set of 87 diverse, soluble FDHs was subjected to a high-throughput activity screen to evaluate which enzymes had detectable activity and, for active enzymes, to develop substrate activity profiles to better understand whether activity and subnetwork membership were related. For initial activity screens, a set of 12 probe substrates—4 indoles, 4 anilines, and 4 phenols—was selected from among the substrates previously found by our group to be reactive under FDH chlorination conditions (Figure A). The key hypothesis governing selection of these substrates was that their high inherent reactivity, reflected in their high calculated halenium affinity values,[34,67] would lead to detectable reactivity with active enzymes even if they exhibited poor binding within FDH active sites. Structural variation within the panel was used to facilitate the identification of viable substrates,[68] and substrates with multiple potentially reactive sites were prioritized to increase the probability that reactive binding poses could be achieved. Initial screens evaluated both chlorination and bromination activities, the two most common halogenation reactions catalyzed by FDHs. The probe substrate screens required at least 2088 independent experiments, not including replicates or controls. This heavy screening requirement prompted us to adopt a high-throughput LC-MS-based screen (Figure B), which also required that viable substrates ionize well by ESI.[69−71] Using this method, analysis throughput of up to ≈11 s per reaction was achieved, and ultimately ≈20 000 experiments were analyzed.

Figure 3

(A) Probe substrates included in initial high-throughput screen. (B) Scheme summarizing LC-MS-based high-throughput screening method employed.

(A) Probe substrates included in initial high-throughput screen. (B) Scheme summarizing LC-MS-based high-throughput screening method employed. A total of 39 new halogenases (45% of soluble enzymes) were able to halogenate at least one of the probe substrates. Halogenation of nearly the entire probe substrate panel was achieved by the genome-mined set of enzymes; only formoterol was not halogenated by at least one new halogenase. Overall, bromination activity was more prevalent than chlorination activity. All genome-mined enzymes that were active had brominase activity, but only 16% of the enzyme set had detectable chlorinase activity. Activity was unevenly distributed across the SSN; certain SSNs had a higher abundance of active enzymes than others (Figure E–F). The Indole Subnetwork had a particularly high percentage of active enzymes; of the 42 Indole Subnetwork enzymes screened, 27 (64%) were active. The fraction of active enzymes was similar for bacterial and eukaryotic proteins, with 48% of bacterial and 56% of eukaryotic enzymes screened having some activity on probe substrates. One of the three viral proteins tested was active, and none of the six archaeal proteins were active. The high-throughput screening conversion data for each reaction were plotted as a heatmap, and hierarchical clustering analysis was used to characterize, separately, the similarity of activity profiles for substrates and for FDHs (Figure ). Substrates tended to form clusters based on their compound class, consistent with the observed similarity of “enzyme-scope” of substrates within the same substrate class.[34] All phenols were present in two substrate clusters, one containing only chlorination reactions and the other containing only bromination reactions. Anilines and indoles were more mixed into the remaining two clusters, but still distinguishable. One of these clusters primarily included indole chlorination, dominated by the high indole chlorination activity of RebH and a highly similar enzyme, 1-B12. The other contained mostly aniline bromination reactions, the high activity for which was more broadly distributed.

Figure 4

Heatmap of high-throughput screening results, with hierarchical clustering dendrograms for substrate/halide activity similarity and enzyme activity similarity at the top and left, respectively. Substrate functional groups and halide used in the reaction are color-coded with bars at the tips of the dendrograms. Only reactions with >8% conversion, a value selected that removed false positives (see the SI). Most importantly, enzymes in the same subnetwork tended to cluster together based on their activity profiles. Four activity clusters of enzymes (AC1–4) can be distinguished from the probe substrate high-throughput screening data. AC1, at the top of the heatmap, contained almost exclusively halogenases in the Indole Subnetwork. None of the Indole Subnetwork enzymes in this AC were in either of the two Level 2 tryptophan subnetworks, however, and they were distinguished by their preference for bromination of phenols and anilines. Despite the fact that indoles are the most common substrates known to be halogenated by enzymes in the Indole Subnetwork, halogenase activity on indoles in AC1 was limited. Pindolol was the only indole halogenated by more than one enzyme, and only a single FDH, 1-F08 (34% identical to SttH), chlorinated more than one indole. Activity cluster 2 (AC2) had similar bromination scope to AC1 but had higher breadth of phenol chlorination activity. Only two of the nine enzymes in this activity cluster were present in the Indole Subnetwork, whereas four were in the Phenol Subnetwork. Halogenase 1-F11, from an unannotated subnetwork within the Indole Subnetwork (38% identical to tryptophan-5 halogenase ClaH[72]), and 2-C01, a halogenase in the same subnetwork as the lanthipeptide indole halogenase MibH (36% identical[54]), were capable of chlorinating multiple phenols, 2,4-dihydroxyacetophenone (2,4-DHAP) and 7-hydroxycoumarin. The FDH 2-C01 was particularly versatile in halide scope. UC-066, 7-hydroxycoumarin, and 2,4-DHAP were chlorinated and brominated by 2-C01 with similar yields. Enzyme 1-F05, from the Phenol Subnetwork (49% identical to ArmH4[73]), was similarly versatile in the halides it accepted, but its activity was specific for phenolic probe substrates. It had the broadest phenol substrate scope of any enzyme tested, but did not halogenate any aniline or indole. Activity cluster 3 (AC3) was small and populated by low-activity enzymes only having bromination activity on the substrates that were most easily halogenated. AC4 contained only two enzymes, RebH and 1-B12, that had the broadest substrate scope, particularly on indole probe substrates. The high probe substrate scope of RebH was expected by design, since the indoles and anilines of the probe panel were assembled from substrates that were known to be chlorinated by RebH. The enzyme 1-B12 has high sequence similarity to RebH (64% identical) and a strongly similar substrate activity profile.

Activity and Selectivity of Mined Halogenases toward Complex Substrates

Based on the remarkable activity that our genome-mined halogenases exhibited toward probe substrates, we next wondered whether they might be capable of halogenating substrates that were not selected from a set of easily halogenated compounds. A total of 50 larger and more three-dimensionally complex additional substrates were selected for these activity studies (Figure A). Among the compounds in this expanded substrate set were yohimbine, a compound for which we previously evolved halogenase activity from RebH,[20] and premalbrancheamide, a compound natively halogenated by the FDH MalA.[38] Most of the substrates have not been reported as FDH substrates previously, including β-estradiol 17-(β-d-glucuronide), an estrogen metabolite, and cabergoline, an ergot alkaloid. A total of 48% of the more complex substrates tested were halogenated by at least one halogenase under the nonoptimized conditions used in the high-throughput screen (Figure B). Hierarchical clustering was performed on the reaction data as above. However, the similarities between enzymes were substantially lower than in the clustering analysis of the probe substrate data, and activity clusters were consequently less defined.

Figure 5

(A) Representative compounds included in expanded high-throughput substrate screen, each of which was halogenated by at least one genome-mined FDH. (B) Heatmap of expanded substrate screen data with 10 of the most active enzymes from the probe high-throughput substrate screen. Larger quantities of several of the most active genome-mined FDHs were expressed, purified, and used for preparative-scale bioconversions on a subset of the larger substrates evaluated (Figure ). Premalbrancheamide is a compound natively dichlorinated by MalA at C5 and C6 and has been shown to be halogenated at the C5 or C6 positions nonselectively using either chloride or bromide as halide sources.[38] The Indole Subnetwork FDH 1-F08 preferentially brominates premalbrancheamide at C5 in 51% isolated yield, and it also brominates AZ20, a selective ATR kinase inhibitor, in 28% isolated yield at the indole C3 position. A different Indole Subnetwork enzyme, 1-F11, was capable of brominating β-estradiol 17-(β-d-glucuronide), an estradiol metabolite, at the 4-position of the steroid in 57% yield and the 1-position of the carvedilol carbazole ring in 56% isolated yield.

Figure 6

Preparative-scale bioconversions of larger substrates.

Preparative-scale bioconversions of larger substrates. Many examples of reactions in which different regioisomers were formed by different enzymes were also identified (Figure ). Pindolol, which is brominated at C7 by RebH variants,[21] is also brominated at C7 by the Indole Subnetwork FDH 1-F11. The MibH subnetwork enzyme 2-C01, on the other hand, preferentially brominates at C2. This finding is notable since C2 is less electronically activated than C7 based on its 3 kcal/mol lower halenium affinity (HalA), a metric for computationally evaluating the reactivity of different positions of a molecule toward EAS.[67] Naringenin is brominated at two different positions using 1-F11 or Phenol Subnetwork enzyme 1-F05. Despite the negligible energetic differences in HalA for C6 and C8 (0.7 kcal/mol), 1-F05 was found to be >95% selective for C8, while 1-F11 and other FDHs were found to have only minor preferences in regioselectivity for C8 or C6. Trp-6,7 halogenase subnetwork enzyme 1-B12 halogenates the indole-containing compound methylergonovine at C7, which has a halenium affinity 4.2 kcal/mol lower than C2, the most nucleophilic aromatic C–H site on this compound. The FDH 2-C01, on the other hand, brominates methylergonovine at C2.

Figure 7

Regiocomplementary halogenation of large molecules.

Discussion

Family-Wide View of FDH Properties

Family-wide analysis of FDHs revealed several notable trends that are not apparent from prior studies. First, FDHs from diverse host organisms can be solubly expressed without significant optimization of expression conditions. Bacterial enzymes had the highest soluble expression success rate (76%), while a lower fraction (40%) of eukaryotic enzymes were soluble. Notably, however, the lower fraction of soluble eukaryotic FDHs reflects the poorer solubility of halogenases in the Phenol Subnetwork regardless of host organism domain. Nearly all (20/23) of the eukaryotic proteins evaluated were within the Phenol Subnetwork. Within this subnetwork, the soluble expression rate is generally low, but it is actually higher for proteins from eukaryotes (54%) relative to bacteria (44%). This finding indicates that eukaryotic FDHs can be readily expressed in E. coli and that genome mining efforts should be encouraged to include enzymes from eukaryotic species.[74] Second, halogenase activity was also evenly distributed between enzymes from bacterial and eukaryotic organisms (48% and 56% active, respectively). FDH activity was not observed for any archaeal proteins evaluated, consistent with the strong possibility that most if not all archaeal sequences in the SSN are geranylgeranyl reductases. Interestingly, one viral FDH, a cyanophage auxiliary metabolic gene product,[75] was active, though its activity and substrate scope were low (conversion of <35% on only three probe substrates was observed). In general, the identification of such a high percentage of active halogenases, despite the use of non-native substrates for activity profiling and a lax homology requirement for evaluation (E-value threshold of 10–5), suggests that this family contains a large number of enzymes suitable for biocatalysis. Third, bromination activity was much more widespread than chlorination activity within the FDHs surveyed. The majority of the FDH biocatalysis literature focuses on chlorination activity because most FDHs reported to date are involved in the biosynthesis of chlorinated natural products. Moore[76] has reported three flavin-dependent brominases involved in the biosynthesis of brominated natural products, but these are more distantly related to enzymes comprising the SSN in the current study. These brominases have 17 ± 4% sequence identity to enzymes in the SSN; for comparison, RebH exhibits 29 ± 10% sequence identity to our genome-mined enzymes. Sewald[45,46] reported flavin-dependent halogenases (contained in the Indole Subnetwork of the FDH SSN) that prefer bromide over chloride when acting on the (presumably) non-native substrate indole. While this observation was taken to indicate specificity of these enzymes toward bromide, our findings indicate that a preference for bromination is common in FDHs. We suggest that the higher electrophilicity of bromine relative to chlorine in heteroatom-X species,[77] such as the proposed hypohalous acid or haloamine halogenating agents in FDH catalysis, leads to more facile bromination. For example, the native chlorinase RebH can brominate a greater range of non-native substrates than it can chlorinate. Preference for bromination over chlorination for non-native as well as native substrates is also observed when both Cl– and Br– are present in solution. In competition reactions including both NaCl and NaBr, RebH prefers bromide over chloride for L-tryptophan, 1-phenylpiperazine, pindolol, and 2,4-dihydroxyacetophenone halogenation.[64] It is therefore possible that enzymes with higher bromination than chlorination scope in our high-throughput screen could nevertheless natively catalyze chlorination reactions.

Analyzing FDH Activity Using Sequence-Similarity Networks and Activity Clustering

Sequence-similarity networks provide an intuitive structure for exploring the protein sequence space of enzyme families. The FDH SSN contains Level 1 subnetworks comprising enzymes with similar native substrate preferences (indole vs phenol, etc.). At a more stringent identity threshold cutoff, Level 2 subnetworks with finer functional distinction are revealed. Within the Level 1 Phenol Subnetwork, for example, different Level 2 subnetworks containing primarily either variant A or variant B halogenases, which natively halogenate free small molecules or ACP-tethered substrates, respectively, are observed. The ability to distinguish such enzyme subclasses based on sequence alone is useful for focusing future genome mining efforts since our data indicate that neither of the variant B phenol halogenases examined were even soluble. Information on site selectivity could also be obtained directly from sequence information in some cases. For example, within the Level 1 Indole Subnetwork, separation of tryptophan 5- and 7-halogenases into distinct Level 2 subnetworks was apparent, though tryptophan 6-halogenases were roughly evenly distributed between these subgroups. Only six of the Level 1 subnetworks (Figure A) examined contained enzymes with measurable chlorination or bromination activity on our probe substrate set, but these subnetworks contained 78% of the FDHs within the SSN. Specifically, enzymes in the Indole Subnetwork (66%), the Phenol Subnetwork (42%), the Pyrrole Subnetwork (2/5), subnetwork 4 (2/2), subnetwork 8 (1/2), and the MibH subnetwork (1/1) were active. These findings reflect the nature of the probe substrates chosen, but given the range of substrates examined and the similarity of these substrates to pharmaceuticals and other fine chemicals, they also highlight regions of FDH sequence space most likely to be of interest for biocatalysis. Active enzymes were found in most Level 2 subnetworks that comprise the Level 1 Indole Subnetwork (Figure B). For example, all enzymes in both tryptophan halogenase subnetworks were active, as was the only enzyme in subnetwork 21. Most enzymes in the BrvH halogenase subnetwork (84%), three out of four enzymes in subnetwork 2PYX, and two of three enzymes (1-C08 and 1-F11) in subnetwork 9 were active. The activity results within the Indole Subnetwork broadly show that a high fraction of these enzymes have potential as useful biocatalysts and highlight several underexplored regions in the FDH sequence space that merit further investigation. Analysis of Level 2 subnetworks within the Level 1 Phenol Subnetwork also highlights regions with high potential for biocatalyst identification. The majority of the tested enzymes in the variant A subnetwork, including 1-F05, were active (66%), but both of the variant B halogenases were insoluble under the conditions examined. The only soluble genome-mined enzyme in the large phenol halogenase subnetwork containing XanH was active. The single evaluated enzyme in the NapH2 subnetwork was inactive, as were the six soluble enzymes that were either singletons or within small (<15 members) subnetworks. Overall, the variant A subnetwork within the Phenol Subnetwork shows clear promise as a source of novel biocatalysts, but further study of other subnetworks would be required to get a clearer picture of their potential. Finally, functional characterization of enzymes across the FDH SSN demonstrated that enzymes within a Level 1 subnetwork have similar activity profiles on smaller probe substrates but that this trend diminishes on more complex substrates. Not surprisingly, more closely related enzymes possess more similar activity profiles. Highly similar substrate activity profiles are observed for RebH and 1-B12 (64% identical), both of which are within the Level 2 tryptophan 6,7-halogenase subnetwork, as well as for halogenases 1-H11 and 1-F10 (50% identical), both of which are in the Level 2 BrvH subnetwork. These trends suggest an approximate %ID threshold for future genome-mining of new halogenases with similar substrate scopes. Because the gene selection process of this study intentionally favored diverse sequences to maximize the breadth of our search for new halogenases, however, there are few instances of such similar enzyme pairs in which both were soluble and highly active. The average %ID for the most closely related enzyme within the genome-mined set was 41.2 ± 12.4%, perhaps too low for similarities in activity profiles among enzymes to result in consistent trends. More thorough genome mining of subnetworks with highly active FDHs could yield more concrete activity profiles and reveal more detailed information regarding enzyme substrate preferences.

Unique Activity and Selectivity of Mined Halogenases

The selectivity of FDHs on their native substrates has driven interest in these enzymes as biocatalysts. The potential of FDHs has been explored by researchers seeking to extend their synthetic utility toward gram-scale synthesis,[24] more facile cross-coupling chemistry,[21−23] synthesis of enantioenriched products,[78,79] and diversification of natural product biosynthetic pathways.[80−82] Other work has sought to make operation more economical, including efforts toward improved cofactor regeneration.[24,83] Because a given FDH may not provide the selectivity required for a particular application, however, a number of laboratories have explored the use of targeted mutations to alter FDH selectivity. While grafting key residues from one tryptophan halogenase into another has been used to switch selectivity on tryptophan,[29] modest selectivity has generally been reported for efforts focused on non-native substrates (e.g., converting SttH from 90% 6-selective to 75% 5-selective for 3-indolepropionic acid chlorination).[30,84] To address this issue, our lab established that directed evolution can be used to generate FDHs with high (>90%) regioselectivity for different sites on a single substrate (tryptamine), and that the resulting enzymes also had altered selectivity on a range of other substrates.[28] Several rounds of evolution were required to achieve this goal, so accelerating the identification of FDHs with complementary regioselectivity on non-native substrates remains an important goal. Gratifyingly, a number of enzymes identified in our family-wide survey of FDH activity exhibited regiocomplementarity on a number of structurally complex substrates. For example, the enzyme 2-C01 often provided different regiochemical outcomes as compared to other halogenases. This FDH is present in a subnetwork along with MibH, which natively chlorinates a tryptophan indole ring in a large lanthipeptide.[54] MibH has a large, hydrophobic binding pocket in order to accommodate its native substrate. The genome-mined halogenase 2-C01 may have a similar active site, which could accommodate large substrates in distinct binding poses. This finding suggests that 2-C01 could be a promising starting point for evolving FDHs to achieve late-stage functionalization of aryl C–H bonds with distinct regioselectivity relative to other FDHs characterized to date.

Comparison of the Genome-Mined Halogenase Library with Evolved Variants

Enzymes frequently require substantial modification before they are capable of being deployed as useful catalysts for organic synthesis. As the regiocomplementarity noted above illustrates, access to a diverse pool of enzymes that can serve as starting points for directed evolution can greatly expedite biocatalyst identification. While evolving a single enzyme can take a great deal of effort and may ultimately fail to provide the desired levels of improvement, a related enzyme may be better suited initially to the task and can drastically reduce the effort required to obtain a desired biocatalyst. This point can be retrospectively illustrated by several enzymes in our genome-mined set, which perform comparably to evolved RebH variants with increased thermal stability,[27] expanded substrate scope,[20] and altered regioselectivity.[28] Halogenase 1-F11 is notable in this regard. FDH biocatalysis is often hampered by low protein expression yields;[56] therefore, a more soluble starting enzyme for FDH directed evolution would be especially attractive. The expression yield of 1-F11 was 125 ± 30 mg/L from a 50 mL expression culture, higher than that of RebH, the expression of which yielded 54 ± 21 mg/L enzyme under analogous expression conditions (Figure A). Higher halogenase activity in lysate on numerous substrates is also observed for 1-F11 compared with RebH.[64] Despite originating from a mesophilic sphingomonas species within an Arabidopsis thaliana root microbiome, 1-F11 has comparable thermal stability (Tm = 66.5 ± 0.2 °C) to RebH variant 3-LSR (Tm = 69.5 ± 0.4 °C, Figure B), which was evolved over three rounds of directed evolution for improved thermal stability.[27] Since more stable enzymes can have a longer catalytic lifetime and can better tolerate random mutations,[86] 1-F11 provides a convenient starting point for directed evolution. 1-F11 also compares favorably in substrate scope with the RebH mutant 4V, which was evolved over four rounds of directed evolution for the late-stage C–H functionalization of yohimbine, a complex, biologically active molecule.[20] Because RebH has minimal activity on yohimbine, a substrate-walking directed evolution approach was required to evolve an enzyme that could halogenate this compound. Enzyme 1-F11, on the other hand, was capable of brominating yohimbine without any modification through directed evolution (Figure C). In short, 1-F11 possesses capabilities that took a total of seven rounds of directed evolution using two different approaches to accomplish, highlighting the benefits of family-wide genome mining for biocatalyst identification. Moreover, given the broad substrate scope of 1-F11, we envision it could be an ideal starting point for further directed evolution.

Figure 8

(A) Comparison of isolated RebH and 1-F11 protein yields after Ni-NTA purification from 50 mL expression cultures. (B) Comparison of CD thermal melts of RebH, thermostable RebH variant 3-LSR, and genome-mined halogenase 1-F11. Curves shown are the best fit for thermal unfolding monitored at 222 nm using CDPal.[85] (C) Wild-type RebH required several rounds of directed evolution before yohimbine halogenation was detectable. Halogenase 1-F11 can halogenate yohimbine without directed evolution (HPLC conversion shown).

Safety Statement

No uncommon safety risks were encountered while conducting the described research.

Conclusions

FDHs were first characterized in the mid-1990s. Since this time, most of the FDHs reported have come from either studies on individual biosynthetic pathways or genome mining efforts targeting specific organisms or metagenomic samples.[19]Figure C and Figure E,F illustrate how these efforts have focused on a remarkably narrow range of FDH sequence space and missed out on large swaths of this space that contain functional enzymes, respectively. Family-wide activity analysis shows that similar fractions of FDHs from bacteria and fungi are soluble and active and that bromination is more commonly observed than chlorination. Broader sampling of this space has not only led to the identification of new enzymes with unique catalytic properties but also highlighted regions of sequence space that are ripe for further exploration. As noted above, other regions might also be suitable for different types of substrates than those examined herein, but for the electron-rich aromatic compounds explored to date, these regions are clearly privileged. Moreover, the SSNs and substrate activity profiles developed in this study offer a predictive ability for focusing biocatalyst selection or further genome mining efforts for particular applications. Extending this approach, involving SSN-guided selection of sequences from throughout an enzyme family, label-free high-throughput mass spectrometry screening using synthetic probe substrates, and activity profiling, to other enzymes has great potential for expediting biocatalyst identification. Beyond these family-wide findings, a number of remarkably useful enzymes were identified in the representative set that was explored. Particularly notable in this regard are 1-F11, 1-F08, 2-C01, 1-F05, and 1-B12. Collectively, these enzymes enable C–H halogenation of previously inaccessible substrates, provide complementary site selectivity on complex biologically active substrates, and exhibit improved thermostability relative to a commonly reported FDH. Their sequences also differ significantly from other FDHs that have been explored in vitro. With the exception of 1-B12, which is 64% identical to RebH, they are only 34–43% identical to FDHs that have been explored as biocatalysts. These novel and diverse halogenases therefore represent promising starting points for both directed evolution and additional genome mining aimed at identifying similarly effective biocatalysts. The activity of these enzymes on complex natural products and pharmaceuticals also suggests that their native substrates could be similarly fascinating structures. It could therefore be interesting to examine the native function of these enzymes, reversing the enzymology-to-biocatalysis progression that has dominated biocatalyst development to date.[4] This approach could provide a unique means of identifying new halogenated natural products and other unique compounds when extended to other enzyme classes.

75 in total

1. Characterization of tiacumicin B biosynthetic gene cluster affording diversified tiacumicin analogues and revealing a tailoring dihalogenase.

Authors: Yi Xiao; Sumei Li; Siwen Niu; Liang Ma; Guangtao Zhang; Haibo Zhang; Gaiyun Zhang; Jianhua Ju; Changsheng Zhang
Journal: J Am Chem Soc Date: 2010-12-27 Impact factor: 15.419

2. Late-Stage Diversification of Biologically Active Molecules via Chemoenzymatic C-H Functionalization.

Authors: Landon J Durak; James T Payne; Jared C Lewis
Journal: ACS Catal Date: 2016-01-25 Impact factor: 13.084

3. Expanding the Imine Reductase Toolbox by Exploring the Bacterial Protein-Sequence Space.

Authors: Dennis Wetzl; Marco Berrera; Nicolas Sandon; Dan Fishlock; Martin Ebeling; Michael Müller; Steven Hanlon; Beat Wirz; Hans Iding
Journal: Chembiochem Date: 2015-07-02 Impact factor: 3.164

Review 4. Enzyme Function Initiative-Enzyme Similarity Tool (EFI-EST): A web tool for generating protein sequence similarity networks.

Authors: John A Gerlt; Jason T Bouvier; Daniel B Davidson; Heidi J Imker; Boris Sadkhin; David R Slater; Katie L Whalen
Journal: Biochim Biophys Acta Date: 2015-04-18

5. Enantioselective Desymmetrization of Methylenedianilines via Enzyme-Catalyzed Remote Halogenation.

Authors: James T Payne; Paul H Butkovich; Yifan Gu; Kyle N Kunze; Hyun June Park; Duo-Sheng Wang; Jared C Lewis
Journal: J Am Chem Soc Date: 2018-01-08 Impact factor: 15.419

6. LodB is required for the recombinant synthesis of the quinoprotein L-lysine-ε-oxidase from Marinomonas mediterranea.

Authors: María Dolores Chacón-Verdú; Daniel Gómez; Francisco Solano; Patricia Lucas-Elío; Antonio Sánchez-Amat
Journal: Appl Microbiol Biotechnol Date: 2013-08-18 Impact factor: 4.813

7. Two Flavoenzymes Catalyze the Post-Translational Generation of 5-Chlorotryptophan and 2-Aminovinyl-Cysteine during NAI-107 Biosynthesis.

Authors: Manuel A Ortega; Dillon P Cogan; Subha Mukherjee; Neha Garg; Bo Li; Gabrielle N Thibodeaux; Sonia I Maffioli; Stefano Donadio; Margherita Sosio; Jerome Escano; Leif Smith; Satish K Nair; Wilfred A van der Donk
Journal: ACS Chem Biol Date: 2017-01-13 Impact factor: 5.100

8. Substrate activity screening: a fragment-based method for the rapid identification of nonpeptidic protease inhibitors.

Authors: Warren J L Wood; Andrew W Patterson; Hiroyuki Tsuruoka; Rishi K Jain; Jonathan A Ellman
Journal: J Am Chem Soc Date: 2005-11-09 Impact factor: 15.419

9. Understanding Flavin-Dependent Halogenase Reactivity via Substrate Activity Profiling.

Authors: Mary C Andorfer; Jonathan E Grob; Christine E Hajdin; Julia R Chael; Piro Siuti; Jeremiah Lilly; Kian L Tan; Jared C Lewis
Journal: ACS Catal Date: 2017-01-31 Impact factor: 13.084

10. CD-HIT: accelerated for clustering the next-generation sequencing data.

Authors: Limin Fu; Beifang Niu; Zhengwei Zhu; Sitao Wu; Weizhong Li
Journal: Bioinformatics Date: 2012-10-11 Impact factor: 6.937

9 in total

1. From Tryptophan to Toxin: Nature's Convergent Biosynthetic Strategy to Aetokthonotoxin.

Authors: Sanjoy Adak; April L Lukowski; Rebecca J B Schäfer; Bradley S Moore
Journal: J Am Chem Soc Date: 2022-02-10 Impact factor: 16.383

2. AoiQ Catalyzes Geminal Dichlorination of 1,3-Diketone Natural Products.

Authors: Mengting Liu; Masao Ohashi; Yiu-Sun Hung; Kirstin Scherlach; Kenji Watanabe; Christian Hertweck; Yi Tang
Journal: J Am Chem Soc Date: 2021-05-06 Impact factor: 15.419

Review 3. Discovery of new enzymatic functions and metabolic pathways using genomic enzymology web tools.

Authors: Remi Zallot; Nils Oberg; John A Gerlt
Journal: Curr Opin Biotechnol Date: 2021-01-05 Impact factor: 10.279

4. Predicting the Substrate Scope of the Flavin-Dependent Halogenase BrvH.

Authors: Pia R Neubauer; Silke Pienkny; Ludger Wessjohann; Wolfgang Brandt; Norbert Sewald
Journal: Chembiochem Date: 2020-08-04 Impact factor: 3.164

5. Machine learning modeling of family wide enzyme-substrate specificity screens.

Authors: Samuel Goldman; Ria Das; Kevin K Yang; Connor W Coley
Journal: PLoS Comput Biol Date: 2022-02-10 Impact factor: 4.475

6. Flavin-dependent halogenases catalyze enantioselective olefin halocyclization.

Authors: Dibyendu Mondal; Brian F Fisher; Yuhua Jiang; Jared C Lewis
Journal: Nat Commun Date: 2021-06-01 Impact factor: 14.919

7. Synthetic C6-Functionalized Aminoflavin Catalysts Enable Aerobic Bromination of Oxidation-Prone Substrates.

Authors: Alexandra Walter; Golo Storch
Journal: Angew Chem Int Ed Engl Date: 2020-10-06 Impact factor: 15.336

Review 8. State-of-the-Art Biocatalysis.

Authors: Joshua B Pyser; Suman Chakrabarty; Evan O Romero; Alison R H Narayan
Journal: ACS Cent Sci Date: 2021-06-25 Impact factor: 14.553

9. Phage-Assisted Continuous Evolution and Selection of Enzymes for Chemical Synthesis.

Authors: Krysten A Jones; Harrison M Snodgrass; Ketaki Belsare; Bryan C Dickinson; Jared C Lewis
Journal: ACS Cent Sci Date: 2021-09-13 Impact factor: 14.553

9 in total