| Literature DB >> 35354826 |
Philip V Toukach1, Ksenia S Egorova2.
Abstract
The Carbohydrate Structure Database (CSDB, http://csdb.glycoscience.ru/ ) is a free curated repository storing various data on glycans of bacterial, fungal and plant origins. Currently, it maintains a close-to-full coverage on bacterial and fungal carbohydrates up to the year 2020. The CSDB web-interface provides free access to the database content and dedicated tools. Still, the number of these tools and the types of the corresponding analyses is limited, whereas the database itself contains data that can be used in a broader scope of analytical studies. In this paper, we present CSDB source data files and a self-contained SQL dump, and exemplify their possible application in glycan-related studies. By using CSDB in an SQL format, the user can gain access to the chain length distribution or charge distribution (as an example) in a given set of glycans defined according to specific structural, taxonomic, or other parameters, whereas the source text dump files can be imported to any dedicated database with a specific internal architecture differing from that of CSDB.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35354826 PMCID: PMC8968703 DOI: 10.1038/s41597-022-01186-9
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Fig. 1Source text dump files and SQL files for CSDB are reported in this paper. The CSDB web-interface, associated web tools, and RDF-ized data have been reported elsewhere[15,16,26,37]. Solid arrows represent immanent logic of the database; dashed arrows show inferred data flows upon usage.
Fig. 2CSDB entity relationship scheme. In each table, the first column corresponds to the field, and the second column – to the data type: N, integer (the symbols of the same color correspond to the same indices connected by arrows in different tables); N.N, float; TEXT, text; FORMAT, formatted text, SET, controlled vocabulary term (see the main text); BOOL, boolean switch; ID, identifier in the external database. The color of the table headers reflects the data category: violet = molecular structure; cyan = compound as a whole; red = bibliography; pink = NMR data; green = taxonomy; olive = glycosyltransferases; grey = simulated conformations; yellow = main relations. The table meaning is explained in parentheses, where unclear from the table name (shown in bold). Blue arrows show one-to-many relations between the fields. Links to external resources are shown in italic; denormalized data are greyed.
CSDB text dump file format.
| Fielda | Explanation |
|---|---|
| ID: | Unique permanent CSDB record ID. Records that have some unresolved problems are marked by star ( |
| TH: | |
| AU: | Comma-separated list of authors. National characters are supported (e.g. |
| TI: | Title of publication (article, chapter, book, symposium thesis). |
| JN: | Journal name without abbreviations, following the NLM standard[ |
| PY: | Publication year. |
| VL: | Volume number. If the volume imprint contains a subvolume/issue, e.g. |
| PG: | Hyphen-separated start and end page numbers. If the end page is unknown, only the start page is provided. For imprints with an article number instead of page numbers, the ID keyword is used, e.g. |
| RL: | References to bibliography: comma-separated list of resource-identifier pairs (resource identifier is given before the colon, reference is after the colon, e.g. |
| EA: | E-mail of the corresponding author, e.g. |
| AD: | Semicolon-separated list of author affiliations (institution, city, country). Each affiliation is listed only once; the order is arbitrary. |
| AB: | Publication abstract. National characters are supported. All chemical structures are either specified in a linear form or replaced with the word |
| ST1: | Chemical structure in CSDB Linear encoding[ |
| ST1ORIGb: | Originally published erroneous chemical structure in CSDB Linear encoding (the field is present only if ST1 contains the revised structure). |
| ST2: | Structure type: |
| ST3: | For polymers: polymerization degree preceded by |
| For oligomers: molecular mass with or without ion type in square brackets, e.g. | |
| Multiple values are separated by commas. | |
| SL: | Structure location in the publication, e.g. structure 1, HPLC fraction 2, compound 7a, Fig. 10, p. 456. In addition to the structure itself, tables with the NMR assignment are specified. |
| AG: | Aglycon information (what is attached to the reducing end of the carbohydrate structure and by which position, if known), e.g. |
| MF: | Molecular formula (for mono- and oligosaccharides only). First carbons, then hydrogens, then alphabetically. |
| NMRH: | 1H NMR assignment data. If the spectrum is published but chemical shifts are not picked, the field is: |
| The proton enumeration within the residues follows the carbon enumeration. If a certain position have no protons, a hyphen ( | |
| The spectrum corresponds to the exact structure specified in ST1. If the structure contains non-stoichiometric moieties (residues preceeded by | |
| NMRC: | 13C NMR assignment data. If the spectrum is published but chemical shifts are not picked, the field is: The carbon enumeration rules are provided at the CSDB Monomer namespace web page ( |
| The spectrum corresponds to the exact structure specified in ST1. If the structure contains non-stoichiometric moieties (residues preceeded by | |
| NMRS: | Solvent, in which NMR experiments were carried out (chemical formula or abbreviation). Mixture components are separated with slash (e.g. |
| NMRT: | Temperature, at which NMR experiments were carried out, in Kelvins. If carbon and proton spectra were recorded at different temperatures, the values are separated with comma and the nuclei are specified in parentheses, e.g. |
| SO: | Semicolon-separated list of biological sources of the structure, without abbreviations. The first word in every source is interpreted as a genus, the second is for species, and all the other words (third and subsequent ones) are combined subspecies, serogroup, and/or strain. |
| If a taxon was renamed or an organism was reclassified after the data had been published, the newer name is given in curly brackets after the organism, e.g. | |
| If the species name is unknown, while the strain is known, the species name is specified as | |
| In the third part of the taxon name, the following order and abbreviations for space-separated subdivisions are used: subspecies ( | |
| The ranks higher than subspecies must comply with the NCBI Taxonomy[ | |
|
| |
| KD: | Before slash ( |
| HO: | Systematic name (genus and species) of the host organism, in which the microorganism (specified in SO) was found, according to NCBI Taxonomy[ |
| OTI: | Organ, tissue, secret, or other biomaterial, from which the structure was extracted. For microorganisms, organelle or cell part can be specified. Organs of host organisms are also allowed. If the life stage is specified (embryo, culture broth, promastigote, etc.), it is provided in this field after the |
| DSS: | Disease of the host organism associated with the structure or its biological source, or disease of the patient, from which the structure was extracted. Multiple diseases are separated with semicolon; attributes are provided in square brackets when possible. A comma-separated list of attributes can include: |
| NC: | (Trivial) name of the compound. Greek letters, apostrophes and quotes are supported. Multiple names are separated by semicolon. |
| CC: | Comma-separated list of classes and roles of the compound, e.g. |
| MT: | Comma-separated list of methods used to elucidate or process the structure. Not yet fully standardized vocabulary of methods is available at |
| BA: | Biological activity of the compound (free text), including binding and serological data. |
| EI: | Comma-separated list of enzymes, as named by the authors, that release or process the structure, including those from other organisms, excluding the enzymes, deletions of which lead to the described (truncated) structure. |
| BG: | Availability of biosynthetic and genetic data in the publication ( |
| SY: | Availability of data on laboratory or industrial synthesis of the compound ( |
| KW: | Comma-separated list of keywords from the publication. Greek letters, apostrophes and quotes are supported. |
| NT: | Any comments that do not fit the other fields (e.g. errors found in the article, reference to revisions, structural elements that cannot be encoded in ST1 or AG, etc.). |
| 3D: | Availability of 3D structure and conformation data ( |
| RR: | Comma-separated list of related CSDB record IDs: |
| - other structures compared with the current structure in the publication; | |
| - fragments of this structure; | |
| - similar structures with minor differences from the current structure; | |
| - structures of the same type from the same organism; | |
| - larger structures that include the current structure as their fragment; | |
| - other subunits of the same molecule; | |
| - etc. | |
| DB: | Cross-references to other structural databases: comma-separated list of resource-identifier pairs (resource identifier is before the colon, reference is after the colon, e.g. |
| TAX: | Comma-separated list of cross-references to the NCBI Taxonomy database, according to the order of organisms in the SO field. The number of values equals to the number of taxa in SO, excluding taxon redesignations in curly brackets. NCBI TaxID of the organism (strain) is provided. If no NCBI TaxID exists for the organism, NCBI TaxID of the species (or genus) is given in parentheses. If |
| U1: | Last name of the CSDB annotator. |
| U2: | Record submission date (DD.MM.YYYY). |
| U3: | For bacterial papers: RefManID of this paper in a local database of the Carbohydrate Chemistry Lab, Zelinsky Institute of Organic Chemistry, RAS. |
| U4: | For bacterial papers: comma-separated list of RefManIDs of the related papers in a local database of the Carbohydrate Chemistry Lab, Zelinsky Institute of Organic Chemistry, RAS. |
| U5b: | If the record was imported from another database (e.g. CarbBank), U5 lists errors found in the original record. |
| U6b: | Filename (with extension) of the publication full text in the CSDB local cache. PMIDs (CSDB ID if no PMID is available) are used as names ( |
aMandatory fields are shown in bold.
bOptional.
Record ID 4676 in the text dump file format.
| Field | Content |
|---|---|
| ID: | 4676 |
| TH: | 1 |
| AU: | Shashkov AS,Kondakova AN,Senchenkova SN,Zych K,Toukach FV,Knirel YA,Sidorczyk Z |
| TI: | Structure of a 2-aminoethyl phosphate-containing O-specific polysaccharide of Proteus penneri 63 from a new serogroup O68 |
| JN: | European Journal of Biochemistry |
| PY: | 2000 |
| VL: | 267(2) |
| PG: | 601–605 |
| RL: | PMID:10632731, DOI:10.1046/j.1432-1327.2000.01041.x |
| EA: | knirel@ioc.ac.ru |
| AD: | N.D. Zelinsky Institute of Organic Chemistry, Russian Academy Of Sciences, Moscow, Russia; Institute of Microbiology and Immunology, University of Lodz, Poland |
| AB: | Lipopolysaccharide of Proteus penneri strain 63 was degraded by mild acid to give a high molecular mass O-specific polysaccharide that was isolated by gel-permeation chromatography. Sugar and methylation analyses and NMR spectroscopic studies, including two-dimensional 1H, 1H COSY, TOCSY rotating-frame NOE spectroscopy, H-detected 1H,13 C and 1H,31 P heteronuclear multiple-quantum coherence (HMQC), and 1H, 13 C HMQC-TOCSY experiments, demonstrated the following structure of the polysaccharide: /structure/ where FucNAc is 2-acetamido-2,6-dideoxygalactose and PEtn is 2-aminoethyl phosphate. The polysaccharide studied shares some structural features, such as the presence of D-GlcNAc6PEtn and an α-L-FucNAc-(1→3)-D-GlcNAc disaccharide, with other Proteus O-specific polysaccharides. A marked cross-reactivity of P. penneri 63 O-antiserum with P. vulgaris O12 was observed and substantiated by a structural similarity of the O-specific polysaccharides of the two strains. In spite of this, the polysaccharide of P. penneri 63 has the unique structure among Proteus O-antigens, and therefore a new, separate serogroup, O68, is proposed for this strain. |
| ST1: | –6)[Ac(1–2)]aDGlcpN(1–3)[bDGlcp(1–4),Ac(1–2)]aLFucpN(1–3)[xXEtN(1-P-6),Ac(1–2)]bDGlcpN(1–2)bDGlcp(1– |
| ST2: | CHEM |
| ST3: | |
| SL: | Fig. 3 (upper panel) |
| AG: | |
| MF: | |
| NMRH: | #2,3,3,2_Ac // #2,3,3_aDGlcpN 5.05 3.98 3.73 3.72 3.95 4.06 // #2,3,4_bDGlcp 4.63 3.45 3.53 3.45 3.42 3.76–3.97 // #2,3,2_Ac // #2,3_aLFucpN 5.00 4.46 4.10 4.08 4.47 1.31 // #2,6,0_xXEtN 4.17 3.32 // #2,6_P // #2,2_Ac // #2_bDGlcpN 4.91 3.91 3.77 3.61 3.62 4.13-4.25 // #_bDGlcp 4.55 3.55 3.55 3.42 3.44 3.73-3.91 // |
| NMRC: | #2,3,3,2_Ac // #2,3,3_aDGlcpN 99.0 54.5 72.4 70.5 72.2 69.4 // #2,3,4_bDGlcp 104.2 75.0 77.5 70.9 76.8 61.9 // #2,3,2_Ac // #2,3_aLFucpN 98.6 50.4 71.4 79.1 68.9 16.7 // #2,6,0_xXEtN 63.2 41.3 // #2,6_P // #2,2_Ac // #2_bDGlcpN 102.2 57.0 78.9 69.5 75.9 66.0 // #_bDGlcp 103.0 81.2 77.4 70.8 76.8 61.6 // |
| NMRS: | D2O |
| NMRT: | 333 |
| SO: | Proteus penneri 63 |
| KD: | bacteria / Proteobacteria |
| HO: | |
| OTI: | |
| DSS: | infection due to Proteus penneri [ICD11:XN7PE] |
| NC: | |
| CC: | O-polysaccharide |
| MT: | NMR-2D, GLC, GPC, methylation, GLC-MS, 1H NMR, 13 C NMR, SDS-PAGE, Western blot |
| BA: | |
| EI: | |
| BG: | |
| SY: | |
| KW: | 2-aminoethyl phosphate,lipopolysaccharide,O-antigen,O-serogroup,Proteus penneri |
| NT: | |
| 3D: | |
| RR: | 4138 |
| DB: | GTC:G92605ZX |
| TAX: | (102862) |
| U1: | Khalfina |
| U2: | 01.06.2004 |
| U3: | 3884 PDF |
| U4: |
Fig. 3Examples of analytical studies carried out earlier directly on CSDB and Glycosciences.de content. (a) Size distribution of carbohydrate sequence units, (b) branching index distribution, and (c) mean charge density distribution in two taxonomic domains. (d) Glycosidic linkage distribution in bacterial oligomeric glycans. Reproduced with permission from[36].
Fig. 4Distribution of structures in CSDB according to their antennarity (a) and net charge (b) in the organisms from the three kingdoms represented in CSDB. Prokaryotes and fungi have a complete coverage on published carbohydrate structures (up to the year 2020), while plants are covered up to 1997 only.
| Measurement(s) | scientific literature on natural glycans |
| Technology Type(s) | annotation of publications |
| Sample Characteristic - Organism | Bacteria • Archaea • Viridiplantae • Fungi |