Literature DB >> 30850540

Discovery of novel carbohydrate-active enzymes through the rational exploration of the protein sequences space.

William Helbert1, Laurent Poulet2, Sophie Drouillard2, Sophie Mathieu2, Mélanie Loiodice2, Marie Couturier2, Vincent Lombard3,4, Nicolas Terrapon3,4, Jeremy Turchetto3,4, Renaud Vincentelli3,4, Bernard Henrissat5,4,6.   

Abstract

Over the last two decades, the number of gene/protein sequences gleaned from sequencing projects of individual genomes and environmental DNA has grown exponentially. Only a tiny fraction of these predicted proteins has been experimentally characterized, and the function of most proteins remains hypothetical or only predicted based on sequence similarity. Despite the development of postgenomic methods, such as transcriptomics, proteomics, and metabolomics, the assignment of function to protein sequences remains one of the main challenges in modern biology. As in all classes of proteins, the growing number of predicted carbohydrate-active enzymes (CAZymes) has not been accompanied by a systematic and accurate attribution of function. Taking advantage of the CAZy database, which groups CAZymes into families and subfamilies based on amino acid similarities, we recombinantly produced 564 proteins selected from subfamilies without any biochemically characterized representatives, from distant relatives of characterized enzymes and from nonclassified proteins that show little similarity with known CAZymes. Screening these proteins for activity on a wide collection of carbohydrate substrates led to the discovery of 13 CAZyme families (two of which were also discovered by others during the course of our work), revealed three previously unknown substrate specificities, and assigned a function to 25 subfamilies.
Copyright © 2019 the Author(s). Published by PNAS.

Entities:  

Keywords:  CAZymes; polysaccharides; screening

Mesh:

Substances:

Year:  2019        PMID: 30850540      PMCID: PMC6442616          DOI: 10.1073/pnas.1815791116

Source DB:  PubMed          Journal:  Proc Natl Acad Sci U S A        ISSN: 0027-8424            Impact factor:   11.205


The last 20 years have witnessed the sequencing of the genomes of isolated unicellular and pluricellular organisms as well as microbial communities from various environments, such as ocean (1, 2), soil (3), and the digestive tract of animals (4) and humans (5, 6). The current challenge is not to obtain even more sequence data, but rather to infer the function of the myriads of already identified proteins (7). Postgenomic approaches, such as transcriptomics, proteomics, and metabolomics, can reveal useful relationships between genes or proteins but do not directly assign function or substrate specificity to hypothetical proteins or enzymes. Therefore, despite the development of faster, cheaper, and miniaturized experimental methods, ascribing a function to a gene product remains the main challenge of biology in the postgenomic era (8). Reliable functional predictions are based on experimentally determined knowledge and on a suitable estimate of the divergence beyond which precise function cannot be readily extrapolated (9). Inspection of sequence databases show that they are heavily polluted by erroneous functional predictions owing to the lack of universal similarity thresholds that can ensure robust propagation of protein function (8, 10, 11). This problem has become particularly acute with the emergence of bioinformatic methods that can detect extremely remote sequence similarities (12–14). The enzymes that assemble and deconstruct glycans have been classified into sequence-based families starting in 1991 (15–20). The functional diversity (specificity) of these enzymes is enormous and reflects the wide diversity of glycan structures found in nature. The database of carbohydrate-active enzymes (CAZymes), CAZy (www.cazy.org), compiles the various families of glycoside hydrolases (GHs), polysaccharide lyases (PLs), glycosyltransferases, and several other categories of enzymes that act on carbohydrates (21). In the classification system that underlies the CAZy database, families are defined by sequences that cluster around at least one biochemically characterized member (21). Interestingly, the sequence-based CAZyme families often group together enzymes of differing substrate specificity (15), showing that the acquisition of novel substrate specificity is commonplace among CAZymes. However, as observed in general protein databases, a similarity search conducted against the entries in the CAZy database essentially yields uncharacterized or unreliably named gene products and thus fails to produce reliable functional inference. In addition, as in all protein databases, the number of entries in CAZy is increasing exponentially, but the number of biochemically characterized enzymes is growing much more slowly (21). For sequence-based functional predictions, the occurrence of enzymes of differing specificity in a given CAZy family results in a broad functional categorization, such as “putative glycoside hydrolase,” but does not provide a reliable prediction of the actual substrate of the enzyme. Furthermore, there are even examples of proteins that have evolved from CAZymes to acquire novel functions unrelated to their CAZyme ancestor (22). Multiple studies have shown that the breakdown of large multifunctional GH families into subfamilies yields a much narrower set of substrate specificities in each subfamily and offers a clear improvement in functional prediction for those subfamilies that have at least one characterized member (23–26). Conversely, subfamilies with no characterized members can guide enzymology investigations toward unexplored areas of the families. The steady increase in the number of CAZyme families over the last 20 years and the accumulation of unassigned sequences with similarity too low for reliable assignment to a family suggests that many other CAZyme families remain to be discovered. The most direct route to ascribing a function to putative CAZymes involves demonstrating the actual cleavage of an oligosaccharide or polysaccharide substrate by the protein of interest. In this context, we assayed the degradation of a collection of substrates with a set of enzymes already classified into CAZyme families but assigned to subfamilies with no biochemically characterized member and with a set of highly remote GH and PL homologs too distant to allow their classification into any current CAZy subfamily or family. The strategy was based on a rational bioinformatic selection of targets, automatic gene synthesis, and screening of recombinant proteins on a wide diversity of carbohydrate substrates. This approach increased the number of biochemically characterized subfamilies and led to the discovery of several new enzyme families and of previously unreported substrate specificities.

Results

We selected 564 nucleotide sequences encoding potential glycan-cleaving enzymes from several families of the GH and PL classes of CAZymes. These sequences composed three broad sets of gene products. The first set (142 GHs and 13 PLs, approximately 28% of the investigated sequences) comprised sequences assigned to subfamilies of large GH and PL families with no characterized members. The second set (203 GHxx_dist and 19 PLxx_dist, approximately 39% of the investigated sequences) comprised sequences that fell outside of established subfamilies or were only distantly related to a particular family. The last set (187 candidates, approximately 33% of the investigated sequences) comprised protein sequences that could not be assigned to a family owing to insufficient similarity (<20% identity) with known GHs or PLs. These sequences were typically extracted from the nonclassified category of putative GHs and PLs (www.cazy.org/GH0.html and www.cazy.org/PL0.html). The sequences were not edited to preserve their native specificity; all possible noncatalytic modules were left intact. The two first sets (subfamilies and distantly related proteins) included several eukaryotic sequences, whereas all sequences of candidate GH or PL proteins (the third set) were of prokaryotic origin. The complete list of sequences selected for this work, along with their source organism and family (and subfamily where possible), are given in ). All genes were codon-optimized for Escherichia coli expression, synthesized, and cloned in expression vectors encoding an N-terminal His tag for protein purification. Expression assays were conducted at microplate scale using an autoinducible medium. Automated purification using nickel-affinity chromatography revealed that approximately 60% of the recombinant proteins were obtained in a soluble state (Fig. 1). We observed a significantly reduced number of soluble proteins of eukaryotic origin (9 of 33 soluble proteins) compared with bacterial and archaeal targets (323 of 506 proteins; hypergeometric test P < 4.10−5). No other significant correlation was found between solubility and a given taxonomic group (phylum, order, or family rank) or with CAZy families (at the subfamily or family level, with and without inclusion of the distant relatives in the families). Upscaling the cultures to 50-mL flasks to generate sufficient amounts for the screening experiments did not reveal any major shift in expression yield. Similarly, we did not observe any change in the molecular mass of the proteins on expression yield, except for the largest proteins (>120 kDa), which are more likely than small proteins to be multimodular (Fig. 1). To increase the number of soluble targets of the study, we tested the effect of solubilizing tags (27). Thus, 24 genes coding for insoluble proteins were cloned in four of the most popular fusion partners: DsbC, thioredoxin, maltose-binding protein, and CpB (NZYTech). These experiments did not improve the yield of soluble proteins, suggesting that our initial strategy was efficient.
Fig. 1.

Overexpression results. (A) Results presented according to three broad classes: (i) proteins from uncharacterized subfamilies within known families (GH/PL subfamilies), (ii) proteins classified into a CAZy family but only distantly related to characterized members (GH/PL distant), and (iii) remote homologs whose similarity is too low for inclusion in existing CAZy families (highly remote). “Soluble” refers to overexpressed proteins purified by nickel-affinity chromatography and detected using gel electrophoresis; “not observed,” to proteins that did not bind to any affinity column (e.g., inclusion bodies, misfolded proteins). (B) Absolute frequency of overexpressed enzymes according to their molecular mass.

Overexpression results. (A) Results presented according to three broad classes: (i) proteins from uncharacterized subfamilies within known families (GH/PL subfamilies), (ii) proteins classified into a CAZy family but only distantly related to characterized members (GH/PL distant), and (iii) remote homologs whose similarity is too low for inclusion in existing CAZy families (highly remote). “Soluble” refers to overexpressed proteins purified by nickel-affinity chromatography and detected using gel electrophoresis; “not observed,” to proteins that did not bind to any affinity column (e.g., inclusion bodies, misfolded proteins). (B) Absolute frequency of overexpressed enzymes according to their molecular mass. Screening was conducted in 96-well microplates, in which the enzymes were incubated with a set of substrates distributed in the microwells. The enzyme activity was revealed using a colorimetric reducing assay and size exclusion chromatography. Because some of the substrates were rare, expensive, or difficult to purify, we took advantage of the CAZy classification to divide the set of substrates into subsets to streamline the screening procedure and minimize loss of substrates (). Members of a given family act on substrates whose glycosidic bonds have the same orientation regardless of the stereochemistry naming conventions (28). For instance, family GH39 contains both β-d-xylosidases and α-l-iduronidases, the substrates of which have an equatorial glycosidic bond (29). Therefore, the screening of enzymes classified into CAZyme families known to act on axially or equatorially linked glycosides was conducted on two sublibraries of substrates containing axial or equatorial glycosidic bonds, respectively (). In a similar vein, enzymes classified into PL families were screened against a set of substrates containing only hexuronides. All proteins that did not cleave a substrate in their initially assigned sublibrary and all proteins from the most distant hypothetical sugar-cleaving enzyme category were tested on all available substrates. For the uncharacterized GH/PL subfamily set, a function was ascribed to 38 proteins classified into 25 distinct GH and PL subfamilies. The sequences selected for screening belonged to a small number of well-defined subfamilies with no characterized representative, found mostly in the GH5 and GH43 families (Table 1). The activities observed for the newly characterized subfamilies from the GH5 (e.g., β-mannanase, β-d-glucopyranosidase, β-d-galactofuranosidase) and GH43 (e.g., β-d-galactofuranosidase, α-l-arabinofuranosidase) families were coherent with previously characterized subfamilies from the same families. The glucuronan lyase and heparin lyase activities identified in the PL7_4 and PL15_2 subfamilies, respectively, represent newly described substrate specificities in the corresponding families, which previously included only alginate lyases. These new specificities demonstrate the polyspecificity of these poorly explored PL families.
Table 1.

Assignment of function to 25 subfamilies

CAZy subfamilyGenBank accession no.SubstrateOrganism
GH5_13ZP_02065960.1pNP-β-d-galactofuranosideBacteroides ovatus ATCC 8483
GH5_13WP_018627464.1pNP-α-l-arabinofuranosideNiabella aurantiaca DSM 17617
GH5_18ACU71175.1pNP-β-d-mannopyranosideCatenulispora acidiphila DSM 44928
GH5_35ACT02895.1ArabinoxylanPaenibacillus sp. JDR-2
GH5_40SCG47572.1Konjac glucomannanMicromonospora rifamycinica DSM 44983
GH5_41ABD80383.1β-mannanSaccharophagus degradans 2–40
GH5_43ADI04784.1pNP-β-d-glucopyranosideStreptomyces bingchenggensis BCW-1
GH5_45SDT09889.1pNP-α-l-arabinofuranoside (weak)Azotobacter vinelandii DJ
GH5_45ACO76963.1pNP-β-d-glucopyranosidePseudomonas oryzae KCTC 32247
GH13_38WP_029428030.1pNP-α-d-maltopyranosideBacteroides cellulosilyticus WH2
GH13_38ABD79820.1pNP-α-d-maltopyranosideSaccharophagus degradans 2–40
GH30_6WP_028726386.1pNP-β-d-cellobiosideParabacteroides gordonii DSM 23371
GH43_2ACU61943.1pNP-α-l-arabinofuranosideChitinophaga pinensis DSM 2588
GH43_2SDS19757.1pNP-α-l-arabinofuranosideMucilaginibacter mallensis MP1X4
GH43_3WP_007211145.1pNP-β-d-galactofuranosideBacteroides cellulosilyticus WH2
GH43_8EIY66405.1pNP-β-d-galactofuranosideBacteroides salyersiae CL02T12C01
GH43_9AMX03466.1pNP-α-l-arabinofuranoside (weak)Microbulbifer thermotolerans DAU221
GH43_17ADQ05609.1pNP-α-l-arabinofuranosideCaldicellulosiruptor owensensis OL
GH43_18WP_029328006.1pNP-α-l-arabinofuranosideBacteroides cellulosilyticus WH2
GH43_18WP_029427512.1pNP-α-l-arabinofuranoside (weak)Bacteroides cellulosilyticus WH2
GH43_18WP_018628786.1pNP-α-l-arabinofuranoside (weak)Niabella aurantiaca DSM 17617
GH43_18AHF90946.1pNP-α-l-arabinofuranoside (weak)Opitutaceae bacterium TAV5
GH43_20SCF26596.1pNP-α-l-arabinofuranosideMicromonospora echinospora DSM 43816
GH43_20CBG71495.1pNP-α-l-arabinofuranosideStreptomyces scabiei 87.22
GH43_23ADO69162.1pNP-α-l-arabinofuranoside (weak)Stigmatella aurantiaca DW4/3–1
GH43_30SCG78792.1pNP-β-d-galactofuranosideStackebrandtia nassauensis DSM 44728
GH43_30ADD39925.1pNP-β-d-galactofuranoside (weak)Micromonospora siamensis DSM 45097
GH43_31AFL85801.1pNP-β-d-galactofuranosideBelliella baltica DSM 15883
GH43_32ACB77177.1pNP-β-d-galactofuranoside (weak)Opitutus terrae PB90-1
GH43_32SDH69004.1pNP-β-d-galactofuranoside (weak)Leifsonia sp. 197AMF
GH43_34WP_044096317.1pNP-α-l-arabinofuranosideBacteroides cellulosilyticus WH2
GH43_34ZP_02066340.1pNP-β-d-galactofuranosideBacteroides ovatus ATCC 8483
GH43_34ACS99115.1pNP-β-d-galactofuranosidePaenibacillus sp. JDR-2
GH43_37ADJ47124.1pNP-β-d-galactofuranoside (weak)Amycolatopsis mediterranei U32
PL7_4ACU70527.1β-glucuronanCatenulispora acidiphila DSM 44928
PL14_2AAC96919.1AlginateParamecium bursaria chlorella virus 1
PL15_2ALJ58962.1Heparan sulfateBacteroides cellulosilyticus WH2

Enzyme activities (substrate specificities) were established using colorimetric and/or chromatography assays. The substrates used as well as the organism of origin of the protein are indicated. “Weak” indicates limited cleavage.

Assignment of function to 25 subfamilies Enzyme activities (substrate specificities) were established using colorimetric and/or chromatography assays. The substrates used as well as the organism of origin of the protein are indicated. “Weak” indicates limited cleavage. In the second set, comprising the distant relatives of established families of GHs and PLs (GHxx_dist and PLxx_dist), the success rate of substrate attribution was 23%, only one-half of that obtained with the set of proteins from well-defined subfamilies. Interestingly, however, in several cases, the enzyme activities ascribed to this distant relatives set corresponded to a new substrate specificity for the corresponding family (Table 2).
Table 2.

Activity of enzymes distantly related to the described GH or PL (GH/PLxx_dist) families

Distant CAZy familyGenBank accession no.SubstrateOrganism
GH2_distWP_029427454.1pNP-β-d-xylopyranoside (new)Bacteroides cellulosilyticus WH2
GH2_distWP_029428707.1Tamarind gum (new)Bacteroides cellulosilyticus WH2
GH2_distWP_029428765.1pNP-β-d-glucuronideBacteroides cellulosilyticus WH2
GH2_distWP_018628801.1pNP-β-d-glucuronideNiabella aurantiaca DSM 17617
GH3_distAJG33435.1pNP-β-d-N-acetyl-glucopyranosideRickettsia rickettsii str. R
GH5_distZP_06241352.1pNP-β-d-mannopyranosideVictivallis vadensis ATCC BAA-548
GH10_distEMS72420.1pNP-β-d-xylopyranoside (weak)Clostridium termitidis CT1112
GH16_distZP_02063674.1pNP-β-d-glucopyranoside (new)Bacteroides ovatus ATCC 8483
GH20_distAEV99795.1pNP-β-d-NAc-6Sulf-glucopyranosideNiastella koreensis GR20-10
GH20_distAHF94523.1pNP-β-d-NAc-glucopyranosideOpitutaceae bacterium TAV5
GH31_distEIY61740.1pNP-α-d-galactopyranosideBacteroides salyersiae CL02T12C01
GH36_distEIY66649.1pNP-α-d-galactopyranosideBacteroides salyersiae CL02T12C01
GH36_distACS99969.1pNP-α-d-galactopyranosidePaenibacillus sp. JDR-2
GH36_distACS99975.1pNP-α-d-galactopyranosidePaenibacillus sp. JDR-2
GH36_distZP_06242255.1pNP-α-d-galactopyranosideVictivallis vadensis ATCC BAA-548
GH42_distEIY59668.1pNP-α-d-mannopyranosideBacteroides salyersiae CL02T12C01
GH49_distEDY96541.1Chaetomorpha sp. CWP (new)Bacteroides plebeius DSM 17135
GH49_distEDY96565.1Chaetomorpha sp. CWP (new)Bacteroides plebeius DSM 17135
GH51_distWP_084555785.1Lichenan (new)Alkaliflexus imshenetskii DSM 15055
GH76_distADO68190.1pNP-α-d-maltoside (new)Stigmatella aurantiaca DW4/3–1
GH106_distWP_018627535.1pNP-α-l-rhamnopyranosideNiabella aurantiaca DSM 17617
GH106_distACT02314.1pNP-α-l-rhamnopyranosidePaenibacillus sp. JDR-2
GH117_distWP_010134686.1pNP-β-d-galactofuranosideFlavobacteriaceae bacterium S85

This set encompasses enzymes that fall outside of established subfamilies or that are only distantly related to biochemically characterized enzymes. “New” designates novel specificity in the family. CWP, cell wall polysaccharide.

Activity of enzymes distantly related to the described GH or PL (GH/PLxx_dist) families This set encompasses enzymes that fall outside of established subfamilies or that are only distantly related to biochemically characterized enzymes. “New” designates novel specificity in the family. CWP, cell wall polysaccharide. When a function could not be attributed to the GHxx_dist and PLxx_dist sequences using the sublibraries corresponding to known substrates of the cognate family, the proteins were screened on all substrates. By doing so, we found that a very distant relative of family PL9 (GenBank accession no. AEI51087.1) is not a PL, but rather a GH able to cleave the main chain of the exopolysaccharide (EPS) secreted by the ubiquitous cyanobacterium Nostoc commune. Therefore, this enzyme and its orthologs define a new GH family, GH160 (Table 3), which may share structural similarity with PL9 lyases. This is the first report of an enzyme able to degrade the EPS of Nostoc spp.
Table 3.

Substrate specificity of new CAZy families

New familyGenBank accession no.SubstrateActivityOrganism
GH147WP_029428318.1β-galactanEndo-β-(1,4)-galactanaseBacteroides cellulosilyticus WH2
GH147EFI37897.1β-galactanEndo-β-(1,4)-galactanaseBacteroides sp. 3_1_23
GH148AGN79260.1Konjac glucomannanEndo-β-(1,4)-glucosidasePseudomonas putida H8234
GH148ACR13278.1Konjac glucomannanEndo-β-(1,4)-glucosidaseTeredinibacter turnerae T7901
GH157WP_029429093.1CM-curdlanEndo-β-glycosidaseBacteroides cellulosilyticus WH2
GH158ZP_06243608.1CM-curdlanEndo-β-glycosidaseVictivallis vadensis ATCC BAA-548
GH159WP_007210837.1pNP-β-d-galactofuranosideβ-d-galactosidaseBacteroides cellulosilyticus WH2
GH160AEI51087.1EPS Nostoc commune (new)Endo-β-(1,4)-galactosidaseRunella slithyformis DSM 19594
PL30WP_029426181.1HyaluronanEndo-hyaluronan lyaseBacteroides cellulosilyticus WH2
PL31ABD82242.1β-glucuronanEndo-β-(1,4)-glucuronan lyaseSaccharophagus degradans 2-40
PL31AGF62897.1β-glucuronanEndo-β-(1,4)-glucuronan lyaseStreptomyces hygroscopicus subsp. jinggangensis TL01
PL32EIY62149.1β-mannuronanEndo-mannuronan lyaseBacteroides salyersiae CL02T12C01
PL33ALJ61728.1HyaluronanEndo-hyaluronanBacteroides cellulosilyticus WH2
PL33AHF90976.1Gellan (new)Endo-gellan lyaseOpitutaceae bacterium TAV5
PL33AHF90672.1Chondroitin sulfateEndo-chondroitin sulfate lyaseOpitutaceae bacterium TAV5
PL33AHF90411.1Gellan (new)Endo-gellan lyaseOpitutaceae bacterium TAV5
PL34AHF91913.1AlginateEndo-alginate lyaseOpitutaceae bacterium TAV5
PL35ZP_06241351.1ChondroitinEndo-chondroitin lyaseVictivallis vadensis ATCC BAA-548
PL36WP_084332190.1β-mannuronanEndo-mannuronan lyaseFlavobacterium denitrificans DSM 15936

The substrate and the modality of substrate degradation are specified. “New” designates novel specificity not reported previously. Note that families GH147 and 148 were reported by other groups during the course of our work (30, 31). CM, carboxymethyl.

Substrate specificity of new CAZy families The substrate and the modality of substrate degradation are specified. “New” designates novel specificity not reported previously. Note that families GH147 and 148 were reported by other groups during the course of our work (30, 31). CM, carboxymethyl. The probability of ascribing a function to the most distant hypothetical sugar-cleaving enzymes (third set) was not expected to be very high; however, we validated GH or PL activities for approximately 18% (19 enzymes) of the 104 soluble proteins screened in this category (Table 3). These enzymes show extremely high divergence from enzymes grouped in known CAZyme families, and thus were identified as the first representatives of six new GH families and seven new PL families. Using chromatographic and NMR methods, we performed a thorough analysis of the reaction products of the four most original enzyme activities (three of which were not previously reported in any CAZy family) that we discovered during the course of our work. , respectively report the characterization of the end products of gellan lyase on gellan (founding member of PL33), of an enzyme able to cleave the polysaccharide secreted by Nostoc spp. (a founding member of GH160), of a galactanase activity (previously unreported in GH147), and of an endo-acting sulfated-arabinan hydrolase (previously unreported in GH49). In some cases, multiple representatives of the new families were characterized. The newly established PL family (PL33) was clearly polyspecific and grouped together gellan lyase, chondroitin sulfate lyase, and hyaluronan lyase. Two of our 13 new families (GH147 and GH148) were reported by others during the course of our work (30, 31). Although this decreases the number of newly described families from 13 to 11, it confirms that our approach is able to uncover families that were discovered using other approaches. Interestingly, our work revealed enzyme activities in families GH147 and GH148 different from those reported elsewhere, again demonstrating that our approach is valid for enzyme discovery. The characteristics of the new families reported here are summarized in .

Discussion

The selection of our targets was based on exploration of the uncharacterized branches of CAZyme family trees, that is, uncharacterized subfamilies, distant relatives of families (GH/PLxx_dist), or highly divergent proteins (GH/PL_nc). Thus, for the first time, a function was attributed to representatives of 25 well-defined subfamilies of the 48 subfamilies initially targeted. A variety of substrate activities have been previously described in the large GH5 and GH43 families (25, 26), which facilitated our investigation due to the expectation that the uncharacterized subfamilies would share a common activity with previously studied ones. This was particularly true in the case of family GH43, for which 14 of the 18 targeted subfamilies displayed α-l-arabinofuranosidase or β-d-galactofuranosidase activity, as has been observed in many previously described GH43 subfamilies. None of the GH43 targets that we produced exhibited activity against sugar beet arabinan or larchwood arabinogalactan, and the GH43 enzyme activity was recorded only on synthetic para-nitrophenyl (pNP)-glycoside substrates. Previous work has shown that the actual substrate of arabinofuranosidases can arise from the sequential action of other specific enzymes during action on complex glycans, such as arabinoxylan, arabinan, and arabinogalactan (30, 32, 33); however, such partially degraded substrates are often not readily available. Thus, it is possible that some differences may emerge between GH43 subfamilies when assaying the enzymes against complex substrates, as discussed by Mewis et al. (26). In only 7 of the 17 targeted GH5 subfamilies could the function be assigned, most likely due to the large number of eukaryotic targets selected in this family, resulting in a low yield of soluble proteins (16 of 50 soluble proteins in GH5 targets, compared with 66 of 102 soluble proteins in other families; hypergeometric test P <10−4). Seven different substrates—pNP-β-d-galactofuranoside, pNP-α-l-arabinofuranoside, pNP-β-d-mannopyranoside, arabinoxylan, konjac glucomannan, β-mannan, and pNP-β-d-glucopyranoside—were needed to characterize the seven GH5 subfamilies, in agreement with the high polyspecificity already reported for the GH5 family (25). The assignment of function to distant GH and PL (GH/PLxx_dist) proteins was more challenging but was also a source of discovery. Seven of the 23 GH/PLxx_dist proteins characterized were active on a substrate that had not been previously reported in the corresponding family. For example, the endo-β(1,4)-glucanase activity of a GH2_dist protein (GenBank accession no. WP_029428707.1), revealed by the degradation of tamarin gum (xyloglucan), had not been previously observed in family GH2. Similarly, another GH2_dist protein (GenBank accession no. WP_029427454.1) displayed a β-d-xylosidase activity not previously reported in family GH2. Interestingly, family GH2 was created in 1991 and has been the subject of numerous biochemical investigations. Therefore, our results demonstrate that polyspecificity remains underestimated even for such well-established GH families, with a direct impact on functional inference from sequence data only. Even more unexpected was the finding that the two distant relatives of the GH49 family (GenBank accession nos. EDY96541.1 and EDY96565.1) can cleave a cell wall polysaccharide from the green algae Chaeotomorpha spp. and Cladophora spp., whose backbone is composed of sulfated arabinan (34), a structure highly dissimilar to dextran and pullulan, previously known as the sole substrates of family GH49 enzymes. The results of NMR analysis of the reaction products of GenBank EDY96541.1 are presented in . The rationale for selecting the most distant hypothetical sugar-cleaving enzymes category was to explore the frontiers of the CAZy families so divergent that bioinformatic methods failed to predict putative functions. Functional screening of the proteins of this category led to assignment of the function of enzymes that are the founding members of 13 new families, 2 of which were described by others during the course of our work. From the establishment of the first 35 GH families in 1991 (15) to the 156 families described to date (for a continuously updated classification, see www.cazy.org), an average of approximately 5 new GH families are created each year. The number of PL families is lower because this class of enzymes is specific to polyuronic acid substrates; starting with 9 PL families in 1999 and reaching 29 to date, the number of PL families has grown at a rate of approximately 1 new family per year. Therefore, an average of six new GH and PL families are described each year. Here we have identified roughly twice the number of new families reported worldwide per year. Our substrate screening strategy for the proteins having very low homology with known enzymes has proven to be efficient for identifying novel candidate GHs and PLs. In a virtuous circle, the novel families now define new frontiers to be explored. This method can now be extended to new sets of hypothetical sugar-cleaving enzymes. We have explored a portion of the large amount of sequence data rationally grouped and classified in the CAZy database. To continue exploring the diversity of sugar-cleaving enzymes, the production of several thousands of recombinant GHs and PLs is now technically possible (35) and is limited only by the cost of gene synthesis which, fortunately, continues to decrease. Thus, the main bottleneck for functional assignment likely is not protein production, but rather the availability of a large and diverse array of substrates. Although this was not a major problem for the screening of enzymes classified in subfamilies of the GH5 and GH43 families, the assignment of function to distantly related enzymes (GHxx_dist and PLxx_dist) and the most distant hypothetical sugar-cleaving enzymes depended directly on the diversity of substrates in the screening library. Thus, the discovery of the first gellan lyase, the first N. commune EPS hydrolase, and the first cladophoran hydrolases was possible only because the respective substrates were present in our glycan library. Significantly, the function of more than 243 soluble proteins produced during this work could not be identified, presumably due to of the lack of suitable substrates, representing a large and untapped potential for subsequent discoveries.

Conclusion

We have shown that it is possible to ascribe the function of putative enzymes distantly related to experimentally characterized GHs and PLs through a systematic exploration of the sequence space coupled with a screening procedure against a collection of diverse carbohydrate substrates. The effectiveness of this strategy is illustrated by the description of 11 new families, the discovery of three new substrate specificities, and the assignment of function to 26 subfamilies, starting from a set of 564 bioinformatically selected proteins. A similar approach conducted on thousands of targets would not only generate more discoveries, but also enable a more reliable, knowledge-based functional prediction for gene products from genomic or metagenomic sequencing projects. Given the decreasing cost of recombinant protein production, the main remaining bottleneck is the availability of a substrate library that parallels the diversity of the glycan structures found in nature.

Materials and Methods

Bioinformatics: Selection of Targets.

The daily updates of the CAZy database rely on the careful analysis of newly released protein sequences from GenBank by comparing them with previously analyzed/stored sequences (21, 36). To obtain accurate annotation, our procedures make use of sequence libraries of varying levels of granularity: subfamilies, families, and remote relatives. In this work, targets were drawn from three categories: “uncharacterized subfamilies,” “distant members within families,” and “hypothetical sugar-cleaving enzymes.” Details of the selection process are provided in .

Screening Experiments.

For this study, E. coli codon optimization, gene synthesis, and cloning of the 539 targets was outsourced to NZYTech. High-throughput expression and purification assays were conducted following the protocol described by Saez and Vincentelli (27). The soluble proteins were screened against the collection of substrates according to the method developed by Fer et al. (37). All positive hits were produced at least twice, and the most interesting enzymes were fully biochemically characterized. The protocol is described in detail in .
  37 in total

1.  Practical limits of function prediction.

Authors:  D Devos; A Valencia
Journal:  Proteins       Date:  2000-10-01

2.  Why are there so many carbohydrate-active enzyme-related genes in plants?

Authors:  Pedro M Coutinho; Mark Stam; Eric Blanc; Bernard Henrissat
Journal:  Trends Plant Sci       Date:  2003-12       Impact factor: 18.313

3.  A classification of glycosyl hydrolases based on amino acid sequence similarities.

Authors:  B Henrissat
Journal:  Biochem J       Date:  1991-12-01       Impact factor: 3.857

4.  Updating the sequence-based classification of glycosyl hydrolases.

Authors:  B Henrissat; A Bairoch
Journal:  Biochem J       Date:  1996-06-01       Impact factor: 3.857

5.  Diet drives convergence in gut microbiome functions across mammalian phylogeny and within humans.

Authors:  Brian D Muegge; Justin Kuczynski; Dan Knights; Jose C Clemente; Antonio González; Luigi Fontana; Bernard Henrissat; Rob Knight; Jeffrey I Gordon
Journal:  Science       Date:  2011-05-20       Impact factor: 47.728

Review 6.  'Unknown' proteins and 'orphan' enzymes: the missing half of the engineering parts list--and how to find it.

Authors:  Andrew D Hanson; Anne Pribat; Jeffrey C Waller; Valérie de Crécy-Lagard
Journal:  Biochem J       Date:  2009-12-14       Impact factor: 3.857

7.  Environmental genome shotgun sequencing of the Sargasso Sea.

Authors:  J Craig Venter; Karin Remington; John F Heidelberg; Aaron L Halpern; Doug Rusch; Jonathan A Eisen; Dongying Wu; Ian Paulsen; Karen E Nelson; William Nelson; Derrick E Fouts; Samuel Levy; Anthony H Knap; Michael W Lomas; Ken Nealson; Owen White; Jeremy Peterson; Jeff Hoffman; Rachel Parsons; Holly Baden-Tillson; Cynthia Pfannkoch; Yu-Hui Rogers; Hamilton O Smith
Journal:  Science       Date:  2004-03-04       Impact factor: 47.728

8.  A framework for human microbiome research.

Authors: 
Journal:  Nature       Date:  2012-06-13       Impact factor: 49.962

Review 9.  COMBREX: COMputational BRidge to EXperiments.

Authors:  Richard J Roberts
Journal:  Biochem Soc Trans       Date:  2011-04       Impact factor: 5.407

10.  Dietary pectic glycans are degraded by coordinated enzyme pathways in human colonic Bacteroides.

Authors:  Ana S Luis; Jonathon Briggs; Xiaoyang Zhang; Benjamin Farnell; Didier Ndeh; Aurore Labourel; Arnaud Baslé; Alan Cartmell; Nicolas Terrapon; Katherine Stott; Elisabeth C Lowe; Richard McLean; Kaitlyn Shearer; Julia Schückel; Immacolata Venditto; Marie-Christine Ralet; Bernard Henrissat; Eric C Martens; Steven C Mosimann; D Wade Abbott; Harry J Gilbert
Journal:  Nat Microbiol       Date:  2017-12-18       Impact factor: 17.745

View more
  42 in total

1.  A carbohydrate-binding family 48 module enables feruloyl esterase action on polymeric arabinoxylan.

Authors:  Jesper Holck; Folmer Fredslund; Marie S Møller; Jesper Brask; Kristian B R M Krogh; Lene Lange; Ditte H Welner; Birte Svensson; Anne S Meyer; Casper Wilkens
Journal:  J Biol Chem       Date:  2019-09-26       Impact factor: 5.157

2.  Conserved unique peptide patterns (CUPP) online platform: peptide-based functional annotation of carbohydrate active enzymes.

Authors:  Kristian Barrett; Cameron J Hunt; Lene Lange; Anne S Meyer
Journal:  Nucleic Acids Res       Date:  2020-07-02       Impact factor: 16.971

3.  High-Throughput Generation of Product Profiles for Arabinoxylan-Active Enzymes from Metagenomes.

Authors:  Maria João Maurício da Fonseca; Zachary Armstrong; Stephen G Withers; Yves Briers
Journal:  Appl Environ Microbiol       Date:  2020-11-10       Impact factor: 4.792

4.  Structural insights into β-1,3-glucan cleavage by a glycoside hydrolase family.

Authors:  Camila R Santos; Pedro A C R Costa; Plínio S Vieira; Sinkler E T Gonzalez; Thamy L R Correa; Evandro A Lima; Fernanda Mandelli; Renan A S Pirolla; Mariane N Domingues; Lucelia Cabral; Marcele P Martins; Rosa L Cordeiro; Atílio T Junior; Beatriz P Souza; Érica T Prates; Fabio C Gozzo; Gabriela F Persinoti; Munir S Skaf; Mario T Murakami
Journal:  Nat Chem Biol       Date:  2020-05-25       Impact factor: 15.040

5.  Penicillium Ochrochloron RLS11 Secretome Containing Carbohydrate-Active Enzymes Improves Commercial Enzyme Mixtures During Sugarcane Straw Saccharification.

Authors:  Túlio Morgan; Daniel Luciano Falkoski; Murillo Peterlini Tavares; Mariana Bicalho Oliveira; Valéria Monteze Guimarães; Tiago Antônio de Oliveira Mendes
Journal:  Appl Biochem Biotechnol       Date:  2022-03-21       Impact factor: 2.926

6.  Structural and molecular basis for the substrate positioning mechanism of a new PL7 subfamily alginate lyase from the arctic.

Authors:  Fei Xu; Xiu-Lan Chen; Xiao-Hui Sun; Fang Dong; Chun-Yang Li; Ping-Yi Li; Haitao Ding; Yin Chen; Yu-Zhong Zhang; Peng Wang
Journal:  J Biol Chem       Date:  2020-09-23       Impact factor: 5.157

7.  CUPRA-ZYME: An Assay for Measuring Carbohydrate-Active Enzyme Activities, Pathways, and Substrate Specificities.

Authors:  Zhixiong Li; Pavel I Kitov; Elena N Kitova; Fahima Mozenah; Emily Rodrigues; Digantkumar G Chapla; Kelley W Moremen; Matthew S Macauley; John S Klassen
Journal:  Anal Chem       Date:  2020-02-07       Impact factor: 6.986

8.  Evaluating microbiome-directed fibre snacks in gnotobiotic mice and humans.

Authors:  Omar Delannoy-Bruno; Chandani Desai; Arjun S Raman; Robert Y Chen; Matthew C Hibberd; Jiye Cheng; Nathan Han; Juan J Castillo; Garret Couture; Carlito B Lebrilla; Ruteja A Barve; Vincent Lombard; Bernard Henrissat; Semen A Leyn; Dmitry A Rodionov; Andrei L Osterman; David K Hayashi; Alexandra Meynier; Sophie Vinoy; Kyleigh Kirbach; Tara Wilmot; Andrew C Heath; Samuel Klein; Michael J Barratt; Jeffrey I Gordon
Journal:  Nature       Date:  2021-06-23       Impact factor: 49.962

9.  Genes for degradation and utilization of uronic acid-containing polysaccharides of a marine bacterium Catenovulum sp. CCB-QB4.

Authors:  Go Furusawa; Nor Azura Azami; Aik-Hong Teh
Journal:  PeerJ       Date:  2021-03-09       Impact factor: 2.984

10.  Host glycan utilization within the Bacteroidetes Sus-like paradigm.

Authors:  Haley A Brown; Nicole M Koropatkin
Journal:  Glycobiology       Date:  2021-06-29       Impact factor: 4.313

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.