Jaya Srivastava1, P Sunthar2, Petety V Balaji1. 1. Department of Biosciences and Bioengineering, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India. 2. Department of Chemical Engineering, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India.
Abstract
Several monosaccharides constitute naturally occurring glycans, but it is uncertain whether they constitute a universal set like the alphabets of proteins and DNA. Based on the available experimental observations, it is hypothesized herein that the glycan alphabet is not universal. Data on the presence/absence of pathways for the biosynthesis of 55 monosaccharides in 12 939 completely sequenced archaeal and bacterial genomes are presented in support of this hypothesis. Pathways were identified by searching for homologues of biosynthesis pathway enzymes. Substantial variations were observed in the set of monosaccharides used by organisms belonging to the same phylum, genera and even species. Monosaccharides were grouped as common, less common and rare based on their prevalence in Archaea and Bacteria. It was observed that fewer enzymes are sufficient to biosynthesize monosaccharides in the common group. It appears that the common group originated before the formation of the three domains of life. In contrast, the rare group is confined to a few species in a few phyla, suggesting that these monosaccharides evolved much later. Fold conservation, as observed in aminotransferases and SDR (short-chain dehydrogenase reductase) superfamily members involved in monosaccharide biosynthesis, suggests neo- and sub-functionalization of genes led to the formation of the rare group monosaccharides. The non-universality of the glycan alphabet begets questions about the role of different monosaccharides in determining an organism's fitness.
Several monosaccharides constitute naturally occurring glycans, but it is uncertain whether they constitute a universal set like the alphabets of proteins and DNA. Based on the available experimental observations, it is hypothesized herein that the glycan alphabet is not universal. Data on the presence/absence of pathways for the biosynthesis of 55 monosaccharides in 12 939 completely sequenced archaeal and bacterial genomes are presented in support of this hypothesis. Pathways were identified by searching for homologues of biosynthesis pathway enzymes. Substantial variations were observed in the set of monosaccharides used by organisms belonging to the same phylum, genera and even species. Monosaccharides were grouped as common, less common and rare based on their prevalence in Archaea and Bacteria. It was observed that fewer enzymes are sufficient to biosynthesize monosaccharides in the common group. It appears that the common group originated before the formation of the three domains of life. In contrast, the rare group is confined to a few species in a few phyla, suggesting that these monosaccharides evolved much later. Fold conservation, as observed in aminotransferases and SDR (short-chain dehydrogenase reductase) superfamily members involved in monosaccharide biosynthesis, suggests neo- and sub-functionalization of genes led to the formation of the rare group monosaccharides. The non-universality of the glycan alphabet begets questions about the role of different monosaccharides in determining an organism's fitness.
Entities:
Keywords:
bioinformatics; data mining; glycobiology
The curated set of proteins used in this study, with domain assignments, is listed in the Supplementary Excel file (supplementary_data.xlsx), available with the online version of this article. The corresponding 396 references with evidence of experimental characterization are included in the Supplementary Material. The results of the genome scan, which include predictions of monosaccharides as well as the biosynthesis pathway enzymes, is available at http://www.bio.iitb.ac.in/glycopathdb/. Python script and associated files used in this manuscript can be found here: https://github.com/jayaasrivastava/GlycopathDBCarbohydrates, nucleic acids and proteins are important classes of biological macromolecules. The universality of DNA, RNA and protein alphabets has been established beyond doubt. However, the universality of the glycan alphabet is unknown, primarily because of the challenges associated with the elucidation of glycan structures. This has precluded a comprehensive investigation of the glycan alphabet. To address this challenge, we have identified the prevalence of 55 monosaccharide biosynthesis pathways in 12 939 completely sequenced archaeal and bacterial genomes by searching for homologues of biosynthesis pathway enzymes using hidden Markov model profiles, and in a few cases blastp. This revealed that the glycan alphabet is highly variable; in fact, significant differences are found even among different strains of a species. Possible implications of this variability may be significant in understanding the evolution of Archaea and Bacteria in diverse and competitive environments. Factors that drive the choice of monosaccharides used by an organism need to be investigated, and will be of interest in understanding host–pathogen interactions. Additionally, the knowledge of the glycan alphabet can be employed for structural characterization/validation of glycans inferred using MS. Knowledge of unique monosaccharides and biosynthetic enzymes can also be used to identify novel drug targets against human pathogens.
Introduction
Living organisms show enormous diversity in organization, size, morphology, habitat, etc., but are unified by the highly conserved processes of central dogma: replication, transcription and translation. The enormous diversity seen in life forms is encoded by DNA and decoded primarily by proteins. Both DNA and proteins use the same set of building blocks (nucleotide bases and amino acids, respectively) in all organisms; yet, they store the requisite information by merely varying (i) the set/subset of building blocks used, (ii) the number of times each building block is used and (iii) the sequence in which the building blocks are linked (collectively referred to as the ‘sequence’) (Table 1). The information required for several other biological processes are stored by glycans, the third group of biological macromolecules [1]. It has been found that glycans evolve rapidly in response to changing environmental conditions, especially in Bacteria and, thus, contribute to organismal diversity [2, 3]. The question is, do glycans use the same set of building blocks (viz. monosaccharides) in all organisms, the way proteins and nucleic acids do?
Table 1.
Sources of diversity in primary structures of DNA, proteins and glycans
Feature
DNA
Protein
Glycan
Structural diversity of building blocks
Low (four nucleotides).
Nucleotide modifications (known but rare): N7-methylation of Ade/Gua.
Higher, relative to DNA (20 amino acids). Has structurally similar pairs: Asp/Glu, Asn/Gln, Phe/Tyr, Leu/Ile/Val.
Amino acid modifications (known but rare): hydroxylation of Pro and Lys, selenocysteine, pyrrolysine.
Highest. Several pentoses and hexoses, many of which are configurational isomers.
Pyranose and furanose forms (e.g. Gal).
Both enantiomeric forms (e.g. Gal).
Modifications extremely common (deoxy, uronic acid, deoxyamino and its derivatives, acetylation, sulfation, etc.)
Linkage
3′,5′-Phosphodiester. 5′,5′-Phosphodiester occurs but very rare.
Amide bond.
γ-COOH of Glu and ε-NH2 group of Lys used but very rare.
Alternative isomeric linkages are very common (α1→3, β1→3, α1→6, β1→4, α2→3, α2→6 and so on).
Sequence
Set/subset of building blocks used.
Number of times each building block is used.
Sequence in which the building blocks are linked.
Branching
Absent
Absent
Quite common
Sequence repeat heterogeneity
Present
Present
Present
Microheterogeneity*
Absent
Absent
Present
*Microheterogeneity refers to the presence of multiple forms of glycans (with minor but distinct variations) present in different molecules of a protein synthesized by a cell at the ‘same’ time. This feature is unique to glycans, just as the presence of splice variants is unique to proteins.
Sources of diversity in primary structures of DNA, proteins and glycansFeatureDNAProteinGlycanStructural diversity of building blocksLow (four nucleotides).Nucleotide modifications (known but rare): N7-methylation of Ade/Gua.Higher, relative to DNA (20 amino acids). Has structurally similar pairs: Asp/Glu, Asn/Gln, Phe/Tyr, Leu/Ile/Val.Amino acid modifications (known but rare): hydroxylation of Pro and Lys, selenocysteine, pyrrolysine.Highest. Several pentoses and hexoses, many of which are configurational isomers.Pyranose and furanose forms (e.g. Gal).Both enantiomeric forms (e.g. Gal).Modifications extremely common (deoxy, uronic acid, deoxyamino and its derivatives, acetylation, sulfation, etc.)Linkage3′,5′-Phosphodiester. 5′,5′-Phosphodiester occurs but very rare.Amide bond.γ-COOH of Glu and ε-NH2 group of Lys used but very rare.Alternative isomeric linkages are very common (α1→3, β1→3, α1→6, β1→4, α2→3, α2→6 and so on).SequenceSet/subset of building blocks used.Number of times each building block is used.Sequence in which the building blocks are linked.BranchingAbsentAbsentQuite commonSequence repeat heterogeneityPresentPresentPresentMicroheterogeneity*AbsentAbsentPresent*Microheterogeneity refers to the presence of multiple forms of glycans (with minor but distinct variations) present in different molecules of a protein synthesized by a cell at the ‘same’ time. This feature is unique to glycans, just as the presence of splice variants is unique to proteins.Monosaccharides show a lot more structural variation than amino acids in terms of the enantiomeric forms (both d and l), size (five to nine carbon atoms), ring type (pyranose, furanose), and type and extent of modification (deoxy, amino, N-formyl, N-acetyl, etc.). Some pairs of monosaccharides differ from each other merely in the configuration of carbon atoms. The sequence (as defined above) of monosaccharides brings about diversity even in the primary structure of glycans. DNA and protein are linear polymers and the linkage type that connects monomers remains the same throughout. In contrast, glycans can be branched and have alternative isomeric linkages (e.g. α1→3, β1→4, α2→6 and so on) [4], two features that enhance diversity in glycans. Repeat length heterogeneity (the number of occurrences of a sequence repeat) is observed in glycans [5, 6], as well as DNA and proteins, although there are no data on the frequency of occurrence of this feature in these three classes of biomolecules. An additional factor that contributes to the diversity in the primary structure of glycans is microheterogeneity [7], a feature not seen in DNA or proteins (Table 1). These structural variations demand the use of multiple analytical techniques for sequencing; hence, there are no automated methods for sequencing glycans. Biosynthesis of DNA and proteins is template-driven, but not that of glycans. Consequently, there is no equivalent of PCR or recombinant protein expression to ‘amplify’ glycans to obtain samples in amounts required for structural/functional analysis. These constraints have largely limited data on glycan sequences.Monosaccharides are viewed as the third alphabet of life [8]. How large is this alphabet? The number of monosaccharides used collectively by living systems is at least 60. An analysis of the bacterial glycan structural data showed a distinct difference in the set of monosaccharides used by bacteria and mammals [9]. Is this difference evidence of absence, i.e. monosaccharides found in databases are true representations of monosaccharides used by these organisms, and those not found are not used by organisms? Or is it just absence of evidence, i.e. the glycan alphabet is indeed universal and the observed differences are merely due to inadequate sequencing? With the availability of the whole-genome sequences of a large number of organisms, it has now become possible to resolve this issue.In this study, it is hypothesized that the glycan alphabet is NOT universal, i.e. different organisms use different sets of monosaccharides. This is in contrast to DNA, RNA and proteins. This hypothesis is put forward based on the observations that >60 monosaccharides are found in living systems, the database of glycan structures shows differential usage of monosaccharides and several serotypes differ from each other in the monosaccharides they use. Results obtained by mining whole-genome sequences of 303 Archaea and 12 636 Bacteria are presented herein in support of this hypothesis. Monosaccharides considered in this study are nucleotide-activated moieties that are utilized by glycosyltransferases (GTs) in the biosynthesis of glycans. Subsequent to such a GT-catalysed transfer, monosaccharides may be modified (e.g. O-acetylation). Monosaccharide derivatives so obtained are not considered in the present study. Enzymes catalysing one or more steps of the biosynthesis pathway are not characterized experimentally for some of the monosaccharides. Such monosaccharides were not considered in this study.
Methods
Databases and software
Protein sequences and 3D structures were obtained from UniProt and PDB (Table S1) databases. Completely sequenced genomes of 303 Archaea and 12 636 Bacteria were obtained from the National Center for Biotechnology Information (NCBI) RefSeq database. These genomes are spread across 3384 species belonging to 1194 genera (Fig. S1). Gene neighbourhood was analysed using feature tables taken from NCBI for the respective genomes. blastp, muscle, hmmer and CD-Hit (Table S1) were installed and used locally. Default values were used for all parameters except when stated otherwise. Word size was set to two for blastp to prioritize global alignments over local alignments. Thresholds for hidden Markov model (HMM) profiles were set based on the best one domain bit score rather than E values, since the former is independent of database size.
Searching genomes for monosaccharide biosynthesis pathways
Pathways for the biosynthesis of 55 monosaccharides have been elucidated to date (Table 2, Fig. S2a–g). HMM profiles were generated using carefully curated sets of homologues for 57 families of enzymes that catalyse various steps of the biosynthesis of the 55 monosaccharides (Supplementary Excel file – Supplementary_data.xlsx: worksheet1). Sequences were used directly as blastp queries when the number of enzymes characterized experimentally was not sufficient for an HMM profile (Supplementary_data.xlsx: worksheet2). In-house Python scripts were used to scan genomes to identify homologues. The presence of a homologue for each and every enzyme of the biosynthetic pathway of a monosaccharide was taken as evidence of the utilization of this monosaccharide by the organism. However, absence of a homologue for even one enzyme of the pathway was interpreted as the absence of the corresponding monosaccharide from the organism’s glycan alphabet.
Table 2.
Summary of the pathways for the biosynthesis of monosaccharides
The monosaccharide l-iduronic acid has not been considered in this study, since there is no separate pathway for its biosynthesis. Dermatan sulfate epimerase-1 or -2 (DS-epi1 or DS-epi2) catalyses C5-epimerization of glucuronic acid to l-iduronic acid in chondroitin sulfate polymeric chains [39]. Enzymes catalysing one or more steps of the biosynthesis pathway are not characterized experimentally for some of the monosaccharides. Such monosaccharides were not considered in this study.
Details about the end product of biosynthesis pathway
Precursor*
Glc-1-P
Fruf-6-P
GDP-Man
UDP-Glc2NAc
Glc2NAc-1-P
Sed-7-P
No. of nucleotide sugars†
27
2
8
16
1
4
No. of monosaccharides‡
25
2
8
16
1
4
No. of monosaccharides with different numbers of backbone carbon atoms
Pentose
4
–
–
–
–
–
Hexose
21
2
8
13
–
–
Heptulose
–
–
–
–
–
4
Nonulose
–
–
–
3
1
–
No. of monosaccharides of the two enantiomeric forms§
d
19
2
5
12
1
3
l
6
–
3
4
–
1
No. of monosaccharides of the two ring forms
Pyranose
23
2
8
16
1
4
Furanose
2
–
–
–
–
–
No. of monosaccharides with different nucleotides
ADP
–
–
–
–
–
1
CDP
7
–
–
–
–
–
CMP
–
–
–
3
1
–
GDP
–
1
8
–
–
3
TDP/dTDP||
9
–
–
–
–
–
UDP
11
1
–
13
–
–
*Glc-6-P is the precursor for Glc-1-P (conversion catalysed by phosphoglucomutase), Fruf-6-P (catalysed by phosphoglucose isomerase) and Sed-7-P (formed in the non-oxidative phase of the pentose phosphate pathway). Fruf-6-P is the precursor of GDP-Man and UDP-Glc2NAc.
†There are two pathways for the biosynthesis of CMP-Leg5Ac7Ac, one starting from UDP-Glc2NAc and the other from Glc2NAc-1-P. Hence, the total number of nucleotide sugars will be 57 even though the row sum is 58.
‡l-Rhamnose and Qui4NAc are biosynthesized as both UDP- and TDP-/dTDP-derivatives. Hence, the number of monosaccharides is less than the number of nucleotide sugars by 2.
§The prefix d is omitted for d enantiomers whereas the prefix l is explicitly mentioned for l enantiomers.
||No distinction is made between TDP and dTDP in this work, since the literature suggests that both ribo- and deoxyribo-substrates are used by enzymes, albeit with varying extents of specificity depending upon the source organism. In fact, dTDP and TDP have been used synonymously by some authors.
Summary of the pathways for the biosynthesis of monosaccharidesThe monosaccharidel-iduronic acid has not been considered in this study, since there is no separate pathway for its biosynthesis. Dermatan sulfate epimerase-1 or -2 (DS-epi1 or DS-epi2) catalyses C5-epimerization of glucuronic acid to l-iduronic acid in chondroitin sulfate polymeric chains [39]. Enzymes catalysing one or more steps of the biosynthesis pathway are not characterized experimentally for some of the monosaccharides. Such monosaccharides were not considered in this study.Details about the end product of biosynthesis pathwayPrecursor*Glc-1-PFruf-6-PGDP-ManUDP-Glc2NAcGlc2NAc-1-PSed-7-PNo. of nucleotide sugars†27281614No. of monosaccharides‡25281614No. of monosaccharides with different numbers of backbone carbon atomsPentose4–––––Hexose212813––Heptulose–––––4Nonulose–––31–No. of monosaccharides of the two enantiomeric forms§d19251213l6–34–1No. of monosaccharides of the two ring formsPyranose23281614Furanose2–––––No. of monosaccharides with different nucleotidesADP–––––1CDP7–––––CMP–––31–GDP–18––3TDP/dTDP||9–––––UDP111–13––*Glc-6-P is the precursor for Glc-1-P (conversion catalysed by phosphoglucomutase), Fruf-6-P (catalysed by phosphoglucose isomerase) and Sed-7-P (formed in the non-oxidative phase of the pentose phosphate pathway). Fruf-6-P is the precursor of GDP-Man and UDP-Glc2NAc.†There are two pathways for the biosynthesis of CMP-Leg5Ac7Ac, one starting from UDP-Glc2NAc and the other from Glc2NAc-1-P. Hence, the total number of nucleotide sugars will be 57 even though the row sum is 58.‡l-Rhamnose and Qui4NAc are biosynthesized as both UDP- and TDP-/dTDP-derivatives. Hence, the number of monosaccharides is less than the number of nucleotide sugars by 2.§The prefix d is omitted for d enantiomers whereas the prefix l is explicitly mentioned for l enantiomers.||No distinction is made between TDP and dTDP in this work, since the literature suggests that both ribo- and deoxyribo-substrates are used by enzymes, albeit with varying extents of specificity depending upon the source organism. In fact, dTDP and TDP have been used synonymously by some authors.
Choice of precursors
Glucose-1-phosphate, fructofuranose-6-phosphate and sedoheptulose-7-phosphate are precursors for many of the monosaccharides (Supplementary_data.xlsx: worksheet6). Fructofuranose-6-phosphate and sedoheptulose-7-phosphate are intermediates in the glycolytic pathways, viz. the Embden–Meyerhof pathway and the pentose phosphate pathway, respectively, and these enzymes were not considered for the search. Pathways for biosynthesis of UDP-Glc2NAc and GDP-mannose have been considered separately, since Glc2NAc and mannose are glycan building blocks as well as intermediates in the biosynthesis of several other monosaccharides. Hence, biosynthesis steps of UDP-Glc2NAc and GDP-mannose were excluded from those of their derivatives. An additional pathway for UDP-glucose biosynthesis was considered to analyse its ubiquity, since UDP-glucose is part of both anabolic and catabolic pathways. The biosynthesis of CMP-Leg5Ac7Ac starting from N-acetyl-glucosamine-1-phosphate has also been considered because of the uncommon guanylyltransferase in the first step of the pathway.
Generation of HMM profiles
An HMM profile was generated for each step of a biosynthesis pathway except where mentioned otherwise. Profiles were generated in two steps (Flowchart S1). The extended dataset was created to account for sequence divergence. In some cases, no additional sequences satisfying the aforementioned criteria were found; hence, there is no extended dataset. Each profile was given an annotation based on the enzyme activities of proteins that were used to generate the profile and an identifier of the format GPExxxxx; here GPE stands for Glycosylation Pathway Enzyme and xxxxx is a unique 5-digit number (Supplementary_data.xlsx: worksheet1).
Setting thresholds for HMM profiles
Thresholds for HMM profiles were set as described below (profile-wise details are given in Supplementary_data.xlsx: worksheet1).
Using (Receiver Operator Characteristic) ROC curves
The TrEMBL database was used to generate ROC curves. Several of the TrEMBL entries have been assigned molecular functions electronically based on UniRule and SAAS (Table S1). It is assumed that these annotations are correct while generating ROC curves. True positives, false positives and false negatives were identified by comparing TrEMBL annotations with profile annotations.
Using bit-score scatter plots
Members of some enzyme families differ in their molecular function, while retaining significant global sequence similarity, e.g. C4- and C3-aminotransferases. Consequently, annotations of several TrEMBL sequences belonging to such families are incomplete, e.g. DegT/DnrJ/EryC1/StrS aminotransferase family protein. In such cases, bit score scatter-plots were used to set thresholds (Fig. S3). Scatter plots were also used to set threshold in case of hydrolysing and non-hydrolysing NDP-Hex2NAc C2 epimerases, since many TrEMBL hits are just annotated as NDP-Hex2NAc C2 epimerases.
Using and as thresholds
T or T
was used as the threshold for some profiles for one of two reasons. (i) The sequences used to generate the profile were a subset of the sequences used to generate another profile; the latter set of enzymes has broader substrate specificity than those of the former set. For instance, sequences used for generating GPE02430 (TDP-/dTDP-4-keto-6-deoxyglucose 3-/3,5-epimerase) and GPE02530 (NDP-sugar 3-/3,5-/5-epimerase) are homologues, but the former set has narrow specificity. T was set as threshold for GPE02430, as lowering the threshold would make this profile less specific. (ii) For some profiles, such as GPE50010 (nucleotide sugar formyltransferase), very few TrEMBL entries that score < T had been assigned molecular function; hence, a ROC curve could not be generated.
The case of GPE00530
Scanning the TrEMBL database with GPE00530 (glucose-1-phosphate uridylyltransferase family 2) using the default threshold of hmmer (E value=10) resulted in 2693 hits with matching annotation and their scores ranged continuously from 705 to 303 bits and then from 57 to 41 bits. It was not possible to generate a ROC curve because of this discontinuity. Hence, 303 bits was set as the threshold.
Profile annotations with broader substrate/product specificities
Many sequence homologues catalyse the ‘same’ reaction, but with (slightly) different substrate specificities. Sequence changes that confer such differential specificities are subtle and often unknown. HMM profiles of such families lack the ability to discriminate between sequences with varying substrate specificities. Two products, a major product and a minor product, are formed in certain enzyme catalysed reactions [10-12]. It is possible that only the major product has been characterized while assaying an enzyme with broader substrate specificity. Another possibility is that only a subset of possible substrates has been assayed for. Hence, substrate specificities are broad in the annotations of some of the profiles. As opposed to these, some profiles of aminotransferases and reductases are generated from enzymes that differ from each other with respect to the product formed, viz. orientation (equatorial or axial) of the newly formed/added -OH/-NH2 group. The profile for 3,4-ketoisomerase is also of this type. UDP-GlcA decarboxylase (UXS) converts UDP-GlcA to UDP-4-keto xylose, which is further reduced to UDP-xylose. UDP-4-keto xylose is a minor product for human UXS, whereas it is a major product for UXS [11]. Both these enzymes were used to generate the profile GPE20030 (Supplementary_data.xlsx: worksheet1).
Pathway steps associated with more than one HMM profile
Some steps are associated with more than one profile for one of two reasons. (i) Non-orthologous enzymes are known to catalyse the same reaction, e.g. phosphomannoisomerases. (ii) Two or more profiles are generated, one with narrow and the other(s) with broad substrate specificity. Enzymes used for the former are a subset of enzymes used for the latter type of profiles, e.g. aminotransferases. The process flow adopted to assign annotation for a sequence that satisfies thresholds for more than one profile is shown in Flowchart S2.
Finding homologues using blastp instead of HMM profiles
HMM profiles were generated only when four or more experimentally characterized enzymes were available (two exceptions are discussed below). Global alignment and sequence similarity were used as the criteria to infer homology based on blastp searches. The default values were set to be ≥90 % query coverage and ≥30 % sequence similarity. However, these values were upwardly revised when query sequences belonged to homologous families that were functionally divergent (Supplementary_data.xlsx: worksheet2). Specifically, similarity and coverage cut-offs were revised by performing an all-against-all blastp search of all experimentally characterized sequences of monosaccharide biosynthesis pathways.PdeG (Q81A42_1–328) is a retaining UDP-Glc2NAc 4,6-dehydratase [13]. It shares higher sequence similarity with inverting UDP-Glc2NAc 4,6-dehydratases than with retaining dehydratases. The sequence of PdeG was compared with TrEMBL hits for the HMM profile of inverting UDP-Glc2NAc 4,6-dehydratases (GPE05331), based on which the sequence similarity cut-off for PdeG was set to 70 %. The threshold for GPE05331 was set such as to exclude PdeG (Fig. S3).
Criteria for finding homologues of UDP-2,4-diacetamido-2,4,6-trideoxy-β-l-altrose hydrolase and UDP-4-amino-6-deoxy-Glc2NAc acetyltransferase
Four experimentally characterized enzymes are known for each of these two families. However, the blastp approach was used instead of generating an HMM profile. This was because a suitable bit score threshold could not be assigned, which, in turn, was because several of the TrEMBL entries obtained as hits were annotated as CMP-N-acetylneuraminic acid synthetase or equivalent (for hydrolase), or O-acetyltransferase or equivalent (for acetyltransferase).
Uncertainties in prediction
Any description of the molecular function of a protein is stratified and includes specifying the type of reaction catalysed, substrate(s) used, etc. A vast majority of sequences conceptually translated from genome sequences are assigned molecular function based on sequence homology to experimentally characterized proteins. Even though experimental validation is available for only a small fraction of proteins due to practical constraints, such studies have shown that homology-based assignments are generally valid, and deviations typically pertain to the extent of substrate specificity, metal ion dependency and such. Nevertheless, caution is warranted with increasing sequence divergence and one has to be on the lookout for homologues that have acquired new molecular functions as a result of mutation of a handful of key residues (neo-functionalization). In view of this, in the present study, HMM and blastp thresholds were chosen with higher stringency and assignment of substrate(s) and product(s) was made conservatively by manually curating false positives and false negatives from the Swiss-Prot database, details of which are given below.Both GDP-rhamnose and GDP-6-deoxytalose were assigned as products of the same pathway, because their biosynthesis proceeds through the same pathway with the exception of the last step being catalysed by homologous C4-reductases. It is not possible to infer whether product specificity of enzymes in this family is absolute or partial, i.e. one is a major product and the other a minor product, due to inadequate experimental data. An identical situation is seen in the pathways for the biosynthesis of CDP-cillose and CDP-cereose, and for CDP-abequose and CDP-paratose. In view of this, prevalence data will be the same for the two monosaccharides of a pair (Supplementary_data.xlsx: worksheet3).Non-hydrolysing NDP-Hex2NAc C2-epimerases (GPE02030) are part of biosynthesis pathways of different monosaccharides. The extent of substrate specificity of the experimentally characterized members of this family is not known, since not all enzymes have been assayed using all possible substrates. In the literature, substrate specificity has been assigned based on the genomic context, and the same approach has been followed in the present study as well. For example, hits for the GPE02030 profile are treated as Man2NAc synthesis pathway enzymes, unless other enzymes of l-Fuc2NAc, l-Qui2NAc or Man2NAc3NAcA pathway are also present.Some monosaccharides are precursors for other monosaccharides; hence, genomes predicted to have the pathway for the latter monosaccharide will also have the precursor monosaccharide. The following are the precursor–final product monosaccharide pairs encountered in this study: (i) l-Rha2NAc → l-Qui2NAc, (ii) l-rhamnose → 6-deoxy-l-talose, (iii) fucose → fucofuranose, (iv) paratose → tyvelose, (v) galactose → galactofuranose, (vi) GlcA → GalA, (vii) l-Ara4N → l-Ara4NFo, (viii) Per → Per4Ac, (ix) Man2NAc → Man2NAcA, (x) Glc2NAcA → Gal2NAcA and (xi) Bac2Ac4Ac → Leg5Ac7Ac.The pathway for the synthesis of l-arabinose is an extension of the pathway for the synthesis of xylose. However, most genomes predicted to have the xylose pathway also have the l-arabinose pathway. This is because UDP-sugar C4-epimerase family members (GPE02230) catalyse C4-epimerization of glucose, GlcA, Glc2NAc, Glc2NacA and xylose. Assigning substrate specificity solely based on sequence similarity is not possible. The challenge is compounded by the fact that some of these enzymes show broad substrate specificity, while the rest are only specific to a single substrate. Not all enzymes have been assayed for all potential substrates.
Results
Glycan alphabet size is not the same across Archaea and across Bacteria
The number of monosaccharides used by different species is significantly different (Fig. 1) and is independent of proteome size (Fig. S4). Data for the prevalence of monosaccharides in 12 939 genomes is very similar to that in 3384 species (Fig. S5), indicating that the outcome is not biased by the skew in the number of genomes (strains) sequenced for a given species (Fig. S1). In fact, none of the organisms use all 55 monosaccharides: the highest number of monosaccharides used by an organism is 23 ( 14EC033). Just 1 and 2 monosaccharides are used by 188 and 117 species, respectively. Glucose, galactose and mannose, and their 2-N-acetyl (Glc2NAc, Gal2NAc, Man2NAc) and uronic acid (GlcA, GalA, Glc2NAcA, Gal2NAcA) derivatives are the most prevalent besides l-rhamnose, as the biosynthesis pathways for these monosaccharides are found in >50 % of genomes (Fig. 2). These monosaccharides are, thus, categorized as the ‘common’ group. However, none of them are used by all organisms (Supplementary_data.xlsx: worksheet3).
Fig. 1.
The number of monosaccharides for which biosynthesis pathways are found in a species. More than one strain has been sequenced for several species (Fig. S1). In such cases, data for the strain that has the highest number of monosaccharides has been plotted. Total number of species=3384.
Fig. 2.
Classification of monosaccharides into three groups based on their prevalence in archaeal+bacterial genomes. These groups are: common (found in ≥50 % of genomes), less common and rare (found in ≤10 % of genomes). Abbreviated names are used for some of the monosaccharides; the full names of these are given in Supplementary_data.xlsx: worksheet4.
The number of monosaccharides for which biosynthesis pathways are found in a species. More than one strain has been sequenced for several species (Fig. S1). In such cases, data for the strain that has the highest number of monosaccharides has been plotted. Total number of species=3384.Classification of monosaccharides into three groups based on their prevalence in archaeal+bacterial genomes. These groups are: common (found in ≥50 % of genomes), less common and rare (found in ≤10 % of genomes). Abbreviated names are used for some of the monosaccharides; the full names of these are given in Supplementary_data.xlsx: worksheet4.
Evolution and diversification of the glycan alphabet
It was observed that only a limited set of enzymes was sufficient to biosynthesize the common group monosaccharides, e.g. nucleotidyltransferases (activation), amidotransferase and N-acetyltransferase (Hex2NAc from a hexose), C4-epimerase (Glc to Gal) and C6-dehydrogenase (uronic acid) belonging to the short-chain dehydrogenase reductase (SDR) superfamily, non-hydrolysing C2-epimerase (Glc2NAc to Man2NAc), mutase (6-P to 1-P) and isomerase (pyranose to furanose) (Fig. 3, Supplementary_data.xlsx: worksheet5). Using this limited set of monosaccharides, organisms seem to achieve structural diversity by mechanisms such as alternative isomeric linkages, branching and repeat length heterogeneity. Some organisms use an additional set of monosaccharides, viz. l-fucose, galactofuranose, xylose, l-Ara4N and l-arabinose. These monosaccharides are categorized as the less common group. Organisms using this group of monosaccharides have enhanced the glycan repertoire by acquiring C3/C5-epimerase, 4,6-dehydratase, C4-reductase, C6-decarboxylase and C4-aminotransferase. The rest of the monosaccharides are used by very few organisms; thus, they constitute the rare group (Fig. 2).
Fig. 3.
A qualitative comparison of the number of monosaccharides of the three groups, viz. common (C), less common (LC) and rare (R), with their prevalence in archaeal+bacterial genomes and the number of types of enzymes required for their biosynthesis. The size of a group is inversely related to the prevalence of the corresponding group of monosaccharides. Enzymes required for the biosynthesis of common group monosaccharides are required for the biosynthesis of the less common and rare groups also; similarly, those for the less common group are required for the biosynthesis of the rare group also. Different enzymes belonging to each of the superfamilies mentioned above are listed in the file Supplementary_data.xlsx: worksheet5. Note that the group sizes are not to scale. It should be noted that additional types of enzymes may have to be included when experimental data about the pathways for the biosynthesis of other monosaccharides becomes available. HAD, Haloalkanoic acid dehalogenase; Gfo/Idh/MocA, glucose‐fructose oxidoreductase/inositol 2‐dehydrogenase/rhizopine catabolism protein MocA; GNAT, GCN5-related N-acetyltransferase; LβH, left-handed β helix; PEP, phosphoenolpyruvate; PLP, pyridoxal 5′-phosphate; SAM, S-adenosyl-l-methionine.
A qualitative comparison of the number of monosaccharides of the three groups, viz. common (C), less common (LC) and rare (R), with their prevalence in archaeal+bacterial genomes and the number of types of enzymes required for their biosynthesis. The size of a group is inversely related to the prevalence of the corresponding group of monosaccharides. Enzymes required for the biosynthesis of common group monosaccharides are required for the biosynthesis of the less common and rare groups also; similarly, those for the less common group are required for the biosynthesis of the rare group also. Different enzymes belonging to each of the superfamilies mentioned above are listed in the file Supplementary_data.xlsx: worksheet5. Note that the group sizes are not to scale. It should be noted that additional types of enzymes may have to be included when experimental data about the pathways for the biosynthesis of other monosaccharides becomes available. HAD, Haloalkanoic acid dehalogenase; Gfo/Idh/MocA, glucose‐fructose oxidoreductase/inositol 2‐dehydrogenase/rhizopine catabolism protein MocA; GNAT, GCN5-related N-acetyltransferase; LβH, left-handed β helix; PEP, phosphoenolpyruvate; PLP, pyridoxal 5′-phosphate; SAM, S-adenosyl-l-methionine.Occurrence of the common group of monosaccharides in all three domains of life points to their presence early on during evolution. Neo- and sub-functionalization of horizontally acquired and duplicated genes during the course of evolution have been widely reported [14, 15]. It is envisaged that the enzymes required for the biosynthesis of rare group monosaccharides have arisen by such neo- and sub-functionalization. Aminotransferase and SDR superfamily enzymes involved in the biosynthesis of monosaccharides lend support to this inference. Superimposition of a few C3- and C4-aminotransferases shows remarkable conservation of the 3D structures despite differences in the pyranose ring position at which the amino group is transferred as well as the nucleotide sugar substrate (Fig. 4). 3D structures are conserved even among SDR superfamily enzymes despite catalysing different reactions, viz. epimerization (at C2 or C4), removal of water (dehydratase at C4, C6) and reduction (at C4).
Fig. 4.
3D structural superimposition of enzymes belonging to aminotransferase (a) and SDR (b) superfamilies involved in the biosynthesis of monosaccharides. Colour scheme: helices, raspberry red; sheet, forest green; loops, light blue. (a) Aminotransferase superfamily enzymes: 1MDO_A, ArnB from UDP-l-Ara4N biosynthesis; 2FNI_A, PseC from CMP-l-Pse45Ac7Ac biosynthesis; 2OGA_A, DesV from TDP-/dTDP-desosamine biosynthesis; 3BN1_A, PerA from GDP-per biosynthesis; 3NYU_A, WbpE from UDP-Man2NAc3NAcA biosynthesis; 4PIW_A, WecE from TDP-/dTDP-Fuc4NAc biosynthesis; 4ZTC_A, PglE from CMP-Leg5Ac7Ac biosynthesis; 5U1Z_A, WlaRG from TDP-/dTDP-Fuc3NAc/Qui3NAc biosynthesis. ArnB, PseC, PerA, WecE and PglE are C4-aminotransferases, whereas DesV, WbpE and WlaRG are C3-aminotransferases. (b) 1ORR_A, RfbE, C2-epimerase from CDP-tyvelose biosynthesis; 2PK3_A, Rmd, C4-reductase from GDP-rhamnose biosynthesis; 1KBZ_A, RmlD, C4-reductase from TDP-/dTDP-l-rhamnose biosynthesis; 1T2A_A, Gmd, C4,C6-dehydratase from GDP-l-fucose biosynthesis; 1SB8_A, WbpP, C4-epimerase from UDP-Gal2NAc biosynthesis; 5BJU_A, PglF, C4,C6-dehydratase from UDP-Bac2Ac4Ac biosynthesis.
3D structural superimposition of enzymes belonging to aminotransferase (a) and SDR (b) superfamilies involved in the biosynthesis of monosaccharides. Colour scheme: helices, raspberry red; sheet, forest green; loops, light blue. (a) Aminotransferase superfamily enzymes: 1MDO_A, ArnB from UDP-l-Ara4N biosynthesis; 2FNI_A, PseC from CMP-l-Pse45Ac7Ac biosynthesis; 2OGA_A, DesV from TDP-/dTDP-desosamine biosynthesis; 3BN1_A, PerA from GDP-per biosynthesis; 3NYU_A, WbpE from UDP-Man2NAc3NAcA biosynthesis; 4PIW_A, WecE from TDP-/dTDP-Fuc4NAc biosynthesis; 4ZTC_A, PglE from CMP-Leg5Ac7Ac biosynthesis; 5U1Z_A, WlaRG from TDP-/dTDP-Fuc3NAc/Qui3NAc biosynthesis. ArnB, PseC, PerA, WecE and PglE are C4-aminotransferases, whereas DesV, WbpE and WlaRG are C3-aminotransferases. (b) 1ORR_A, RfbE, C2-epimerase from CDP-tyvelose biosynthesis; 2PK3_A, Rmd, C4-reductase from GDP-rhamnose biosynthesis; 1KBZ_A, RmlD, C4-reductase from TDP-/dTDP-l-rhamnose biosynthesis; 1T2A_A, Gmd, C4,C6-dehydratase from GDP-l-fucose biosynthesis; 1SB8_A, WbpP, C4-epimerase from UDP-Gal2NAc biosynthesis; 5BJU_A, PglF, C4,C6-dehydratase from UDP-Bac2Ac4Ac biosynthesis.
Glycan alphabet varies even across strains
Remarkably, variations in the size of the glycan alphabet are significant even at the strain level (Fig. 5). Strain-specific differences are pronounced in species such as , and (Fig. 6), possibly reflecting the diverse environments that these organisms inhabit. Among organisms that inhabit the same environment, strain-specific differences show a mixed pattern: among the 71 strains of , the maximum and minimum number of monosaccharides utilized by a strain are 4 and 12, respectively. Such a variation could have evolved as a mechanism to evade the host immune response. In contrast, strains of and strains of inhabit the same environment (respiratory tract and skin, respectively), and show very little variation in the monosaccharides they use. Both are capsule-producing opportunistic pathogens, suggesting that they might bring about antigenic variation by variations in linkage types, branching, etc. [16], even with the same set of monosaccharides. Strains of , , or , all of which are human intracellular pathogens, also show insignificant variation. It is possible that different strains of a pathogen are a part of distinct microbiomes and microbial interactions within the biome/with the host determine the glycan alphabet of the organism. Availability of additional characteristics, such as phenotypic data and temporal variations in glycan structures, is critical for understanding the presence/absence of strain-specific variations.
Fig. 5.
Variations in the number of monosaccharides used by different strains of a species. Species with more than one sequenced strain and at least one monosaccharide predicted in one of the strains are considered. Only the smallest and largest numbers are shown.
Fig. 6.
Different strains of some of the species do not use the same number of monosaccharides. The ranges of the number of monosaccharides used by various strains of some of the clinically important species are shown here. The number of sequenced strains for each organism is shown above the corresponding bar. The number in parenthesis after the name of each organism represents the minimum number of monosaccharides used by one of the strains of this organism. Note that the set of monosaccharides encoded by different strains utilizing the same number of monosaccharides may vary. Organisms associated with a narrow habitat are shown in blue, while those with broad habitat are shown in purple.
Variations in the number of monosaccharides used by different strains of a species. Species with more than one sequenced strain and at least one monosaccharide predicted in one of the strains are considered. Only the smallest and largest numbers are shown.Different strains of some of the species do not use the same number of monosaccharides. The ranges of the number of monosaccharides used by various strains of some of the clinically important species are shown here. The number of sequenced strains for each organism is shown above the corresponding bar. The number in parenthesis after the name of each organism represents the minimum number of monosaccharides used by one of the strains of this organism. Note that the set of monosaccharides encoded by different strains utilizing the same number of monosaccharides may vary. Organisms associated with a narrow habitat are shown in blue, while those with broad habitat are shown in purple.
Prevalence of monosaccharides across phyla
Not all sugars of the common group (Fig. 1) are found across all phyla, whereas Neu5Ac belonging to the rare group is found across all phyla. GlcA and GalA (common group) are absent in , suggesting that pathways for their biosynthesis are lost in this phylum. A similar conclusion is drawn for the absence of l-fucose and l-colitose in the TACK (Thaumarchaeota, Aigarchaeota, Crenarchaeota, and Korarchaeota) group. Most of the rare group sugars are limited to a very few species in a few phyla (Fig. 7). For instance, Fuc4NAc and l-glycero-β-d-manno-heptose (ADP-linked) are found only in , a class that comprises several pathogens. The other three heptoses, which are GDP-linked, are absent in . Recently, it was found that , belonging to the class , synthesizes ADP-glycero-β-d-manno-heptose for activating the NF-κβ pathway in human epithelial cells [17]. This pathway has been experimentally characterized in very few organisms. Consequently, homologues for this pathway were found by blastp queries and not by HMM profiles. In the present study, this pathway turned out to be a false negative because of the high stringency set for blastp thresholds. In view of this, it is possible that such sugars which appear restricted to a few phyla are also found in others.
Fig. 7.
Prevalence of less common and rare group monosaccharides in different microbial phyla. Data for phyla with less than five sequenced genomes are not shown to avoid visual clutter. Only names of monosaccharides are used for annotation even though all are biosynthesized as nucleotide sugars. Abbreviated names are used for some of the monosaccharides; the full names of these are given in Supplementary_data.xlsx: worksheet4.
Prevalence of less common and rare group monosaccharides in different microbial phyla. Data for phyla with less than five sequenced genomes are not shown to avoid visual clutter. Only names of monosaccharides are used for annotation even though all are biosynthesized as nucleotide sugars. Abbreviated names are used for some of the monosaccharides; the full names of these are given in Supplementary_data.xlsx: worksheet4.
Why do some Eubacteria not biosynthesize any monosaccharide?
None of the monosaccharides are biosynthesized by some mollicutes (e.g. ) and endosymbionts (e.g. sp. and sp.), because the biosynthesis pathways are completely absent. Mollicutes lack cell walls [18], which could explain the absence of monosaccharides. Endosymbionts have reduced genomes, which is seen as an adaptation to host dependence [19, 20]. Biosynthesis pathway enzymes are lost/are being lost as part of the phenomenon of genome reduction. This is illustrated by the endosymbiont : 13 of the 25 strains have the pathway for the biosynthesis of UDP-Glc2NAc, 7 have a partial pathway and 5 do not encode any genes of this pathway. Pathways for none of the other monosaccharides are found in this organism. Pathways are incomplete, i.e. enzymes catalysing one or more steps of the pathway are absent in some organisms. Some species of , and lack mannose-1-phosphate guanylyltransferase, because of which GDP-mannose is not biosynthesized. GlmU, which converts Glc2N-1-phosphate to UDP-Glc2NAc, is absent in sp. However, Glc2N is found in the lipopolysaccharide of [21]. Whether this is indicative of the presence of a transferase that uses Glc2N-1-phosphate instead of UDP-Glc2N needs to be explored.
Do spp. and spp. source monosaccharides from their host?
spp. (60 strains), (7 strains) and spp. (143 strains) are obligate intracellular bacteria. does not contain pathways for the biosynthesis of any of the monosaccharides. This is in consonance with the finding that it does not contain extracellular polysaccharides [20]. species have pathways for the biosynthesis of Man2NAc, l-Qui2NAc and l-Rha2NAc. l-Rha2NAc is the immediate precursor for l-Qui2NAc (Fig. S2e). are known to use Man2NAc and l-Qui2NAc but not l-Rha2NAc [22], indicative that UDP-l-Rha2NAc is just an intermediate in these organisms. The pathway for the biosynthesis of UDP-Glc2NAc, precursor for these Hex2NAcs, is absent, suggesting partial dependence on the host (human). Notably, genes for the biosynthesis of Man2NAc and l-Qui2NAc have not been reported so far in humans, which explains why have retained these pathways (the human genome was scanned and these pathways were not found; unpublished data). Both and belong to the same order, . Symptoms caused by these two are similar [23]. In spite of similarities in host preference and pathogenicity, spp. continue to use certain monosaccharides while diverging from [24], which uses none. Is this because use ticks as vectors, whereas use mites [25]? , the only rickettsial species that uses mites as vectors and contains pathways for Man2NAc and l-Qui2NAc biosynthesis, has been proposed to be placed as a separate group, because its genotypic and phenotypic characteristics are intermediate to those of and [25].
Absence of Glc2NAc in organisms other than endosymbionts
UDP-Glc2NAc is the precursor for the biosynthesis of several monosaccharides (Fig. S2e, f). However, pathways for its biosynthesis are absent in ~10 % of the genomes excluding endosymbionts. None of the organisms in the FCB (Fibrobacteres, Chlorobi, and Bacteroidetes) group and contain this monosaccharide. Further analysis revealed the loss of the first (GlmS) or last (GlmU) enzyme of the pathway in several of their genomes. This pattern suggests that organisms of thse phyla are in the process of losing the UDP-Glc2NAc pathway. Incidentally, some of these genomes do contain its derivatives. They include host-associated organisms such as Bacteriodes fragilis, Flavobacterium sp., , , , , etc., suggesting that they obtain Glc2NAc from their microenvironment. However, a few free-living organisms that contain derivatives of UDP-Glc2NAc but not UDP-Glc2NAc were also identified. For instance, GlmU is not present in (arctic surface seawater) and its C-terminus (acetyltransferase domain) is absent in (freshwater). Nonetheless, both organisms contain the UDP-l-Qui2NAc pathway cluster.
Prevalence of enantiomeric pairs and isomers of N-acetyl derivatives
Both enantiomers of a few monosaccharides are reported in natural glycans. The two enantiomers may or may not be biosynthesized from the same precursor, and may be linked to different nucleotides (Table S2). The present analysis shows that both enantiomers are found in only a small number of organisms, in specific genera, class or phyla (Table 3). Three isomeric N-acetyl derivatives of fucosamine (6-deoxygalactosamine) and of quinovosamine (6-deoxyglucosamine) are found in living systems. The N-acetyl group is present at C2, C3 or C4 position in these isomers. Only a few organisms use more than one of these three isomers (Table 3). One such organism is NCTC11151, which contains both Fuc4NAc and Fuc3NAc. In contrast, O177:H21 uses l-Fuc2NAc along with Fuc3NAc. Genomic context analysis showed that Fuc4NAc biosynthesis genes are part of the O-antigen cluster in both these strains. However, genes for the biosynthesis of Fuc3NAc (in NCTC11151) and l-Fuc2NAc (in O177:H21) are present as part of the colanic acid cluster. Four genomes (strains) of use Qui4NAc, Qui2NAc and l-Qui2NAc; genes required for the biosynthesis of these three monosaccharides are all in the same genomic neighbourhood.
Table 3.
Presence of enantiomeric pairs and isomeric N-acetyl derivative pairs
The diastereomeric pair of Fuc4NAc and l-Fuc2NAc are found in some strains of .
Monosaccharide*
Where present
No. of organisms (genomes) in which these monosaccharide pairs are used
Both d- and l-enantiomers
Galactose
Extremophiles
33
Fucose
Phyla FCB group
Phylum Deferribacteres
Class Gammaproteobacteria
7
1
10
Rhamnose
Genus Pseudomonas
338
6-Deoxytalose
Genus Pseudomonas
140
Qui2NAc
Phylum Proteobacteria
64
Fuc2NAc†
Genus Staphylococcus
300
Isomers of N-acetyl derivatives
Fuc3NAc and Fuc4NAc‡
Family Enterobacteriaceae
33
Qui2NAc, Qui4NAc
Several phyla
193
*Abbreviated names are used for some of the monosaccharides. Full names of these are given in Supplementary_data.xlsx: worksheet4.
†Both Fuc2NAc and l-Fuc2NAc are components of capsular polysaccharides [40].
‡Glucose-1-phospate is the precursor for both Fuc3NAc and Fuc4NAc, and UDP-Glc2NAc is the precursor for Fuc2NAc (the isomer that is absent in these organisms).
Presence of enantiomeric pairs and isomeric N-acetyl derivative pairsThe diastereomeric pair of Fuc4NAc and l-Fuc2NAc are found in some strains of .Monosaccharide*Where presentNo. of organisms (genomes) in which these monosaccharide pairs are usedBothGalactoseExtremophiles33FucosePhyla FCB groupPhylumClass7110RhamnoseGenus3386-DeoxytaloseGenus140Qui2NAcPhylum64Fuc2NAc†Genus300Isomers ofFuc3NAc and Fuc4NAc‡Family33Qui2NAc, Qui4NAcSeveral phyla193*Abbreviated names are used for some of the monosaccharides. Full names of these are given in Supplementary_data.xlsx: worksheet4.†Both Fuc2NAc and l-Fuc2NAc are components of capsular polysaccharides [40].‡Glucose-1-phospate is the precursor for both Fuc3NAc and Fuc4NAc, and UDP-Glc2NAc is the precursor for Fuc2NAc (the isomer that is absent in these organisms).
Why are some pathways not found in Archaea?
Most of the rare group monosaccharides are absent in Archaea. Members of contain a higher number of monosaccharides than the TACK group. This could be suggestive of lateral gene transfer events with bacterial members, as members of , particularly methanogens, coexist with other organisms in microbiomes [26] and have been inferred to acquire their genetic content [27]. It is premature to associate the absence of monosaccharide diversity to the apparent lack of pathogenicity in Archaea [26]. This is because of inadequate information regarding the abundance of Archaea in various microbiomes. This, in turn, is due to our limitations in the detection of Archaea and associating them with disease phenotypes.Apart from these possibilities, methodological limitations may have resulted in the apparent absence of monosaccharides in Archaea. Only 4–5 % of the 789 sequences used for generating HMM profiles or as blastp queries are from Archaea. The pathway for the biosynthesis of TDP-/dTDP-l-rhamnose has four enzymes, viz. RmlA, RmlB, RmlC and RmlD. Of these, only RmlB could not be found by HMM profile in sp., sp. and sp., leading to the conclusion that l-rhamnose is absent in these organisms. Analysis of the neighbourhood of RmlA, RmlC and RmlD revealed a sequence that could potentially be RmlB, since it retains conserved residues of this family. This sequence could not be captured by the profile-based search due to stringent thresholds (=400 bits) (profile GPE05430; Supplementary_data.xlsx: worksheet1). Potential RmlB sequences of these organisms score 300–350 bits. This observation suggests that the pathway exists in these organisms, but was not identified due to the stringency of the threshold. However, this is in contrast to other cases of absence of monosaccharides, wherein none of the proteins of a pathway in the genome score even the default bit score of hmmer (i.e. 10 bits).
Use of more than one nucleotide derivative/alternative pathways
l-Rhamnose and Qui4NAc are biosynthesized as both UDP- and TDP-/dTDP-derivatives (Fig. S2a, c). However, the TDP-/dTDP-pathways are found in Archaea and Bacteria, but not the UDP-pathways. TDP-/dTDP-6-deoxy-l-talose is biosynthesized via reduction of TDP-/dTDP-4-keto-l-rhamnose or C4 epimerization of TDP-/dTDP-l-rhamnose (Fig. S2a). The former pathway occurs in 141 genomes belonging to multiple phyla, and notably in sp., sp. and sp. The latter pathway is found in 255 genomes belonging to and Terrabacteria, and notably in sp., sp. and . N,N′-Diacetyl legionaminic acid can be biosynthesized either from the UDP-route or GDP-route (Fig. S2f). The latter pathway is found in 93 of 96 genomes of , whereas the former is found in 10 other genomes primarily belonging to /.
Discussion
The importance of glycans, especially in Archaea and Bacteria, is well documented. Establishing the specific role of glycans and studying structure–function relationships is largely hindered by factors such as the non-availability of high-throughput sequencing methods, inadequate information as to which genes are involved in non-template driven biosynthesis, phase variation [28] and microheterogeneity [7]. In this study, completely sequenced archaeal and bacterial genomes were searched for monosaccharide biosynthesis pathways using a sequence-homology-based approach. It was found that the usage of monosaccharides is not at all conserved across Archaea and Bacteria. This is in stark contrast to the alphabets of DNA and proteins, which are universal. In addition, marked differences are observed even among different strains of a species. The range of monosaccharides used by an organism seems to be influenced by environmental factors such as growth (nutrients, pH, temperature, etc.) and environmental (host, microbiome, etc.) conditions. For instance, high uronic acid content in exopolysaccharides of marine bacteria imparts an anionic property, which is implicated in uptake of Fe3+; thus, promoting its bioavailability to marine phytoplankton for primary production [29] and against degradation by microbes [30]. Mutation in genes that encode enzymes for the biosynthesis of lipopolysaccharide in was shown to confer resistance to T7 phage [31]. Thus, organisms, even at the level of strains, seem to evolve to modify their monosaccharide repertoire to increase fitness. In fact, selection pressure and horizontal gene transfer events could be the reason for the monosaccharide repertoire of bacteria far exceeding those of mammals and other eukaryotes.Genes encoding enzymes for the biosynthesis of Neu5Ac are found in 5 and 0.6% genomes of and , respectively; the bacterial carbohydrate structure database had no Neu5Ac-containing glycan from organisms belonging to this class/phylum [9]. l-Rhamnose and l-fucose are found in 16 % of and genomes and in 25 % of genomes. However, very few l-rhamnose- and l-fucose-containing glycans from these classes/phyla are deposited in the database, leading to the inference that these are rare sugars in these classes/phyla. This is indicative that monosaccharide usage based on an analysis of experimentally characterized glycans can at best give a partial picture.Rare group monosaccharides are those that are found only in a few species, genera and phyla. Reasons for acquiring rare group sugars can at best be speculative. For instance, Bac2Ac4Ac occurs at the reducing end of glycans N- and O-linked to proteins [32], but the presence of Bac2Ac4Ac is not mandatory for PglB, an oligosaccharyltransferase, since it can transfer glycans that have Glc2NAc, Gal2NAc or Fuc2NAc also at the reducing end [33]. Perhaps, Bac2Ac4Ac provides resistance to enzymes like PNGase F that cleave off N-glycans. l-Rhamnose, Neu5Ac, l-Qui2NAc, Man2NAc and l-Ara4N are not used by (a non-pathogen), but are used by (a pathogen). It is tempting to infer that these monosaccharides impart virulence to the latter, but analysis of monosaccharides used by strains belonging to multiple pathotypes (enterohaemorrhagic, enteropathogenic, uropathogenic) did not reveal any relationship between monosaccharides and their phenotype. Tyvelose, paratose and abequose are 3,6-dideoxy sugars that belong to the rare group. These are found primarily in , and . These are present in the O-antigen of [34]. , closely related to and derived from , lacks O-antigen (rough phenotype) due to the silencing of the O-antigen cluster [35]. , also an enteric pathogen like , does not contain these monosaccharides. Hence, the role of these 3,6-dideoxy sugars in the O-antigen of does not seem to be related to enteropathogenicity.Besides answering the question of the universality of the glycan alphabet, this study also has led to certain beneficial outcomes. l-Rhamnose, mannose and l-Pse5ac7Ac are found in Bacillus cereus, Bacillus mycoides and , but not in Bacillus subtilis, Bacillus amyloliquefaciens, Bacillus licheniformis, Bacillus velezensis and Bacillus vallismortis. Such differences potentially may be exploited towards taxonomic identification, provided that these patterns hold true after analysis of a larger number of strains from each of these species. Enzymes synthesizing monosaccharides that are exclusive to a pathogen vis-à-vis its host can be identified as potential drug targets. An illustrative example is the non-hydrolysing C2 epimerase: it mediates the synthesis of UDP-Man2NAc, UDP-l-Qui2NAc, UDP-l-Fuc2NAc and UDP-Man2NAc3NAc, and is found in 60 % of the archaeal+bacterial genomes, but not in humans (the human genome was scanned for the presence of these pathways; unpublished results). It has already been reported that inhibitors of this enzyme are effective against meticillin-resistant and a few other bacteria [36]. Based on the prevalence of this enzyme in all other phyla, inhibitors against this enzyme would be promising broad-spectrum antimicrobial therapies. As already noted [37], knowledge of monosaccharide composition is also useful for ensuring consistency of recombinant glycoprotein therapeutics. Knowledge of biosynthesis pathways also allows cloning the entire cassette in a heterologous host for large-scale production of monosaccharides for commercial and research applications.Thus, glycans show the least evolutionary conservation among the three macromolecules (carbohydrates, nucleic acids and proteins) [38]. Owing to their virtue of endowing distinction, existence of a universal glycan alphabet is antithetical. Here, alphabet is used in the same sense as its dictionary meaning, viz. a set of letters or symbols that combine to form complex entities. In the case of glycans, structural diversity arises not only by the set of monosaccharides an organism uses, but also by linkage variations (α1→3, β1→4, etc.), branching and modifications (e.g. sulfation, acetylation, etc.). Knowledge of the linkage types, branching patterns and modifications that an organism uses will further our understanding of the biological roles of glycans.Click here for additional data file.Click here for additional data file.
Authors: Michael Wacker; Mario F Feldman; Nico Callewaert; Michael Kowarik; Bradley R Clarke; Nicola L Pohl; Marcela Hernandez; Enrique D Vines; Miguel A Valvano; Chris Whitfield; Markus Aebi Journal: Proc Natl Acad Sci U S A Date: 2006-04-25 Impact factor: 11.205
Authors: Stephan Herget; Philip V Toukach; René Ranzinger; William E Hull; Yuriy A Knirel; Claus-Wilhelm von der Lieth Journal: BMC Struct Biol Date: 2008-08-11