| Literature DB >> 33178355 |
Benjamin P Kellman1, Yujie Zhang1, Emma Logomasini1, Eric Meinhardt1, Karla P Godinez-Macias1, Austin W T Chiang1, James T Sorrentino1, Chenguang Liang1, Bokan Bao1, Yusen Zhou1, Sachiko Akase1, Isami Sogabe1, Thukaa Kouka1, Elizabeth A Winzeler1, Iain B H Wilson1, Matthew P Campbell1, Sriram Neelamegham1, Frederick J Krambeck1, Kiyoko F Aoki-Kinoshita1, Nathan E Lewis1.
Abstract
Systems glycobiology aims to provide models and analysis tools that account for the biosynthesis, regulation, and interactions with glycoconjugates. To facilitate these methods, there is a need for a clear glycan representation accessible to both computers and humans. Linear Code, a linearized and readily parsable glycan structure representation, is such a language. For this reason, Linear Code was adapted to represent reaction rules, but the syntax has drifted from its original description to accommodate new and originally unforeseen challenges. Here, we delineate the consensuses and inconsistencies that have arisen through this adaptation. We recommend options for a consensus-based extension of Linear Code that can be used for reaction rule specification going forward. Through this extension and specification of Linear Code to reaction rules, we aim to minimize inconsistent symbology thereby making glycan database queries easier. With a clear guide for generating reaction rule descriptions, glycan synthesis models will be more interoperable and reproducible thereby moving glycoinformatics closer to compliance with FAIR standards. Here, we present Linear Code for Reaction Rules (LiCoRR), version 1.0, an unambiguous representation for describing glycosylation reactions in both literature and code.Entities:
Keywords: glycoinformatics; linear code; systems glycobiology
Year: 2020 PMID: 33178355 PMCID: PMC7607430 DOI: 10.3762/bjoc.16.215
Source DB: PubMed Journal: Beilstein J Org Chem ISSN: 1860-5397 Impact factor: 2.883
The reaction rule Ab3GNb → Ab3(Fa4)GNb represented in Symbol Nomenclature for Glycans [18], Linear Code, IUPAC, GlycoCT, and WURCS separately. Linear Code provides the most straightforward and succinct representation.
| Reactant | Product | |
| Structure plot | ||
| Linear Code | Ab3GNb | Ab3(Fa4)GNb |
| IUPAC-extended | β-ᴅ-Gal | β-ᴅ-Gal |
| IUPAC-condensed | Gal(β1-3)GlcNAc(β1- | Gal(β1-3)[Fuc(α1-4)]GlcNAc(β1- |
| glycoCT | RES | RES |
| WURCS | WURCS=2.0/2,2,1/[a2122h-1b_1-5_2*NCC/3=O][a2112h-1b_1-5]/1-2/a3-b1 | WURCS=2.0/3,3,2/[a2122h-1b_1-5_2*NCC/3=O][a2112h-1b_1-5][a1221m-1a_1-5]/1-2-3/a4-c1_a3-b1 |
Figure 1Common terminology and anatomy of a theoretical glycan, (KJ(IH)GF(D(E)(C)B)A. In this figure, we demonstrate some key terminology as well as the three primary uncertainty operators: branch (orange), continuation (blue), and ligand (green). The structures matching these terms are shown in matching colors, those matching both the continuation and ligand are shown in purple. A ligand can typically be removed with one cut. A continuation is a connection from a node to a root that can “continue” or “bypasses” other branch points. The paths from I to G or K to G represent one continuation; to represent both paths, a continuation is necessary because traversing from I to G requires the syntactic “bypass” of the KJ branch.
Glossary of essential terms.
| Term | Definition |
| saccharide unit (SU) | composed of a monosaccharide name, modifications (if any), anomericity (α or β configurations of the glycosidic bond), and the position it is bonded to a given SU. |
| monosaccharide (MS) | a sugar monomer. |
| lowest-carbon-index chain | the lowest carbon index branch corresponding to the non-reducing sugar connected to the lowest reducing-end carbon. |
| branch | any right branch, pictorially “right” of the reducing MS ( |
| reducing and non-reducing ends | these are the MSs that appear “first” (closest to the glycoconjugate or first added in the synthesis) and “last” (leaves or terminal MS, those farthest from the “first” MS within a branch and have no linkage to a non-reducing MS). Typically, there is one reducing end and there are often multiple non-reducing ends. |
| reducing MS | closer to the first MS or the “reducing-end”. |
| non-reducing MS | farther from the first MS and closer to a non-reducing end. |
Original Linear Code rules (Banin et al. [31]).a
| Rule description | Example | |
| saccharide unit (SU) | 1. see one-letter MS names in | |
| 2. the anomer, where an α conformation is denoted as “a,” and β as “b,” follows the one-letter MS name. | Ga, Gb | |
| 3. the carbon number by which the SU is attached follows the anomer. | Ga3, Gb2 | |
| 4. modifications. see modification rules for details, which follow after the carbon number. | see modification rules examples. | |
| open form (OF) | 1. open form notation. If a carbon is in its open-chain form, an “o” is attached to the end. | AbGo |
| stereospecificity and ring structures (SRS) | 1. the less common stereoisomer (ᴅ or ʟ) of an MS is indicated with apostrophes (‘). | D-Glc |
| 2. MSs with uncommon ring structures (e.g., furanose, pyranose) are indicated with a caret (^). | D-Glc | |
| 3. MS that differ in both common stereospecificity and ring structure are indicated with a tilde (~). | D-Glc | |
| modification rules (MR) | 1. the modifications are represented by adding square brackets that include the connecting position of the modification to the SU, followed by the modification symbol ( | β-D-Gal |
| branch rules (BR) | 1. when the non-reducing MSs are identical, the MS linked to the higher index carbon will branch (appear first in the written representation when read right to left, reducing to non-reducing end). | GNb2Ma3(NNa3Ab3GNb2Ma6)Mb4GNb |
| 2. when the non-reducing MSs are different, the less frequent non-reducing MS will branch (MS frequency | Ab3ANb4(NNa3)Ab4Gb | |
| repetition rules (RR) | 1. repeating units are expressed inside parentheses, with an ‘n’ representing the number of repeats. | cellulose, which is a polymer of ᴅ-glucose residues joined by β-1,4 linkages are represented as {nGb4} |
| 2. if not the non-reducing end, the head of a repeated motif is expressed two dashes “ - - ” | {nGa6Ga4(-Ab3-)Ub2Ha3Ha3Ha3} | |
| 3. if not the reducing end, the tail of a cyclic motif is expressed using the letter “c”. | nGa6Ga4(-Ab3-)Ub2Ha3H | |
| glycoconjugate rules (GR) | 1. amino acid sequences are written after ‘;’. Lipid moieties are written after ‘:’. Other glycosides are written after ‘#’. | Ga;NY-S-C. |
| uncertainty rules (UR) | 1. α or β linkage unknown, or connection position unknown: ? | AN?3G |
| 2. both linkage and connection position unknown: ?? | AN??G | |
| 3. an entire SU unknown: * | ANb3*A | |
| 4. when two possibilities are given for the identity of an SU element, use “/” | ANb3/4 | |
| 5. when two options are given for the identity of a complete SU, use “//” | Ab4//Ga2Aa3 represents Ab4Aa3 or Ga2Aa3 | |
| 6. for glycan fragments, use an index number + ‘%’ as a variable for the fragment, and a ‘|’ to separate the fragment from the core. | NNa6=1%|1%Ab4GNb2Ma3(1%Ab4GNb2Ma6)Mb4Gb denotes that Ab4GNb2Ma3(Ab4GNb2Ma6)Mb4Gb is the core, and that the linkage of the fragment NNa6 to the core is uncertain. % means uncertain, 1 is the index referring to the uncertain MS. | |
a“(#)” - Rules deprecated in LiCoRR.
Common monosaccharides and their Linear Codes (adapted from [31]). We have added NG as it has become a clearly important monosaccharide excluded from the original list. Full monosaccharide descriptions are recorded in IUPAC [18]; all terms can be found at https://www.qmul.ac.uk/sbcs/iupac/2carb/38.html.
| Monosaccharidesa | Linear Code | IUPAC |
| ᴅ-glucose | G | Glc |
| ᴅ-galactose | A | Gal |
| GN | GlcNAc | |
| AN | GalNAc | |
| ᴅ-mannose | M | Man |
| NN | Neu5Ac | |
| * | NG | Neu5Gc |
| neuraminic acid | N | Neu |
| 2-keto-3-deoxynononic acid | K | KDNc |
| 3-deoxy-ᴅ-manno-2 octulopyranosylonic acid | W | Kdo |
| ᴅ-galacturonic acid | L | GalA |
| ʟ-iduronic acid | I | ᴅ-IdoA |
| ʟ-rhamnose | H | Rha |
| ʟ-fucose | F | Fuc |
| ᴅ-xylose | X | Xyl |
| ᴅ-ribose | B | Rib |
| ʟ-arabinofuranose | R | Ara |
| ᴅ-glucuronic acid | U | GlcA |
| ᴅ-allose | O | All |
| ᴅ-apiose | P | ᴅ-Api |
| ᴅ-fructofuranose | E | Fru |
| *ascaryloseb | C | Asc |
| *ribitolb | T | Rib-ol (Rbo) |
aAll the monosaccharides are in their pyranose form unless otherwise noted. bAsterisk (“*”) represents an update from the original table. cKDN: 3-deoxy-ᴅ-glycero-ᴅ-galacto-nonulosonic acid. Kdn: 3-deoxy-ᴅ-glycero-ᴅ-galacto-nonulosonic acid.
Common modifications and their Linear Code (from [31]).
| Modification type | Linear Code | IUPAC |
| deacetylated | Q | N |
| phosphoethanolamine | PE | Pe |
| inositol | IN | In |
| methyl | ME | Me |
| N | NAc | |
| T | Ac | |
| phosphate | P | P |
| phosphocholine | PC | Pc |
| pyruvate | PYR | Pyr |
| sulfate | S | S |
| sulfide | SH | Sh |
| aminoethylphosphonate | EP | Ep |
| *deoxya | D | d |
| *carboxylic acida | CA | -oic |
| *aminea | A | -amine |
| *amidea | AO | -amide |
| *ketonea | K | -one |
aAsterisk (“*”) represents an update from the original table.
Figure 2Monosaccharide reachability analysis. (A) Clusters contain monosaccharides with highly similar stereochemistry (>80%). (B) The maximum common substructure (MCS) associated with each cluster. (C) An example to illustrate the modifications needed to reach one monosaccharide to another, as identified by the complete monosaccharide reachability network (Table S6, Supporting Information File 1). (D) The monosaccharides reachability network, showing only connectivity for the least number of modifications needed, differentiated by color as stated in the legend, between monosaccharides (circle) and clusters (diamond). Additionally, the node size denotes the number of different possible paths taken for them to be reached. Please note that each edge is not a predicted or proposed feasible reaction. Edges denote functional groups that can be added or removed from one monosaccharide to represent another.
The difference between “ _ ”, “ … ” and “ | ” with illustrations. These symbols were proposed by Krambeck et al. [10]. The initial names are ligand (“ ... ”), continuation (“ _ ”), and possible branch (“ | ”). Each uncertainty operator in the last four example columns can be replaced by the substring in red to achieve the behavior described in the column header. For a more comprehensive look at the usage of these uncertainty operators, see Supporting Information File 1, Table S1 for a manual collection of matches, and Table S4 (Supporting Information File 1) for an automated collection of matches.a
| Add a whole new branch | Initiating branch | Extending lowest-carbon-index chain | Initiating nested branch | |||
| Symbols | Syntax | Meaning | ||||
| any string where every ‘(‘ has a matching ‘)’. Includes the empty string. | chain bypassing a branch to reach a reducing MS; A continuation cannot necessarily be removed by splitting one linkage (can contain branches) | B_A | B(C_A | B_A | E(D_A | |
| any string with all matching parentheses. Includes the empty string. | chain to a reducing MS; A ligand can typically be removed by splitting one linkage (can contain branches) | B...A | B...A | |||
| ‘)’ or ‘(...)’ or ‘)(...)’. Or an empty string. | possible branch point. | B+A | B(C+A | |||
aA, B, C, D, E are abstract monosaccharides.
The reaction rule (GN → (Ab4GN with four constraints written in the same cell.
| Enzyme | Reactant | Product | Constraint |
| b4GalT | (GN | (Ab4GN | *...GNb2|Ma3 or |
The reaction rule (GN → (Ab4GN with four constraints written on separate lines.
| Enzyme | Reactant | Product | Constraint |
| b4GalT | (GN | (Ab4GN | *...GNb2|Ma3 |
| b4GalT | (GN | (Ab4GN | *...GNb4|Ma3 |
| b4GalT | (GN | (Ab4GN | *...GNb2|Ma6 |
| b4GalT | (GN | (Ab4GN | *...GNb6|Ma6 |
Reaction rules from multiple N-glycan biosynthesis models in LiCoRR representation. This table describes select rules from Krambeck et al. [10] in LiCoRR and LiCoRRICE representation. Representations across multiple manuscripts can be found in Linear Code, LiCoRR and LiCoRRICE in Table S5 (Supporting Information File 1).
| Enz. | Substrate | Product | Constraints (LiCoRR) | Constraints (LiCoRRICE) |
| (Ma2Ma | (Ma | !@2Ma3(…Ma6)Ma6 & !Ga3 | nMan(a1-?)>4 & nMan(a1-?)<8 & !Man(a1-2)Man(a1-3)...Man(a1-6) & !Glc(a1-3) | |
| (Ma3(Ma2Ma3(Ma6)Ma6) | (Ma3(Ma3(Ma6)Ma6) | !Ga3 | !Glc(a1-3) | |
| (Ma3(Ma6)Ma6 | (Ma6Ma6 | (GNb2+Ma3 & !Gnbis | !Gal(b1-?) & !GlcNAc(b1-4)...Man(b1-4) & GlcNAc(b1-2)Man(a1-3) | |
| (Ma6Ma6 | (Ma6 | (GNb2+Ma3 & !Gnbis | ||
| GNb4GN | GNb4(Fa6)GN | GNb2+Ma3 & #A=0 & !Gnbis | GlcNAc(b1-2)Man(a1-3)...Man(b1-4) & !GlcNAc(b1-4)...Man(b1-4) & !Fuc(a1-3) | |
| (Ma3(Ma3(Ma6)Ma6)Mb4 | (GNb2Ma3(Ma3(Ma6)Ma6)Mb4 | nMan(a1-?)=4 | ||
| (GNb2+Ma3(Ma6)Mb4 | (GNb2+Ma3(GNb2Ma6)Mb4 | nMan(a1-?)=2 & !GlcNAc(b1-4)...Man(b1-4) & !Fuc(a1-3) & !Gal(b1-?) | ||
| GNb2+Ma3 | GNb2+Ma3(GNb4) | !Ab & !Gnbis | GlcNAc(b1-2)Man(a1-3)...Man(b1-4) & !Gal(b1-?) | |
| (GNb2Ma3 | (GNb2(GNb4)Ma3 | !Gnbis | !Gal(b1-?) & !GlcNAc(b1-4)...Man(b1-4) | |
| (GNb2Ma6 | (GNb2(GNb6)Ma6 | !Gnbis | !Gal(b1-?) & !GlcNAc(b1-4)...Man(b1-4) | |
| (Ab4GN | (GNb3Ab4GN | !@_Ma3+Mb4 | ||
| (GN | (Ab4GN | !@GNb4)(...Ma6)Mb4 | !Gal(b1-3)GlcNAc(b1-?) & !@GlcNAc(b1-4)...Man(b1-4) | |
| (GN | (Ab3GN | !@GNb4)(...Ma6)Mb4 | !Gal(b1-4)GlcNAc(b1-?) & !@GlcNAc(b1-4)...Man(b1-4) | |
Symbols previously used by systems glycobiologists and our recommendations. Rows a–i are the functions implemented by published papers. Rows j–m are the functions prescribed in the original Linear Code rules. (A) Symbols to represent reaction rules across publications utilizing Linear Code. (B) Consensus and recommendation for reaction rule representation going forward.
| (A) | (B) | ||||||||
| Symbol used | OLC [ | Kra [ | Spa [ | Lia [ | Hou [ | Consensus adaptation of OLC to reaction rules | LiCoRR | Examples | |
| a | logical negation | ~ | ~ | ~ | ~ | ~ | ! | !Ma | |
| b | and | & | & | & | & | !Ma & Ab3 | |||
| c | or | or | or | or | separate rules, | !Ma or Ab3 | |||
| d | continuation (left parenthesis matched to right parenthesis. ) | _ | ... | ... | _ | ... or _ | _ | see | |
| e | ligand (all parenthesis matched) | ... | ... | ... | ... | see | |||
| f | possible branch point | | | | | | | | | | | + | see | |
| g | reaction site (Code change site) | * | * | * | * | * | @ | !@…Ma2 | |
| h | possible modification | $ | $ | A$GN | |||||
| i | number | # | # | n | nA=0 nA>2 | ||||
| j | divide certainty and uncertainty ( | | | nothing | nothing | |||||
| k | omission of an entire SU ( | * | nothing | * | ANb3*N | ||||
| l | glycosides ( | ;, :, # | ; | nothing | ; for amino acid, | Ga;NY-S-C Gb:C | |||
| m | MS with uncommon stereospecificity and ring structure ( | ~ | nothing | ~ | L-Glcf: G~ | ||||
Abbreviations: OLC (Original Linear Code [31]), Kra (Krambeck et al. [10]), Spa (Spahn [13]), Lia (Liang et al. [14]), Hou (Hou et al. [7]).