Literature DB >> 21045056

GlycomeDB--a unified database for carbohydrate structures.

René Ranzinger¹, Stephan Herget, Claus-Wilhelm von der Lieth, Martin Frank.

Abstract

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：
Carbohydrates

Year: 2010 PMID： 21045056 PMCID： PMC3013643 DOI： 10.1093/nar/gkq1014

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

In a recent NIH whitepaper (1) the lack of a comprehensive, curated carbohydrate structure database was identified as the largest deficit in glycomics and glycobiology research. The Complex Carbohydrate Structure Database (CCSD) (2), initiated in the 1980s, was the largest effort to date to collect carbohydrate structures, mainly through retrospective manual extraction from the literature. The database contained about 50 000 entries when it ceased to be updated in the late 1990s due to a lack of funding. Since then different specialized databases have been developed, which were initially seeded with a subset of the structures contained in the CCSD (3). Subsequently these databases were further extended with carbohydrate structures reflecting the research focus of the group that maintained the database. As a result, different valuable collections of carbohydrate data have emerged over recent years, for example: the Bacterial Carbohydrate Structure Database (BCSDB) (4) that collects all published bacterial carbohydrate structures (including their NMR spectra); the database of the Consortium for Functional Glycomics (CFG) that provides access to primary experimental data like that from glycan microarray screens (5); and the Kyoto Encyclopedia of Genes and Genomes (KEGG) that contains glycan-related biosynthetic pathways (6). Unfortunately each of these databases uses a different ‘sequence format’ for encoding carbohydrate structures, making it difficult to query across all public databases and analyze or compare their content, or simply to find out whether some additional information on a particular carbohydrate structure is available in any of the databases.

GlycomeDB—SCOPE AND IMPLEMENTATION

In 2005, a new initiative was begun to overcome the isolation of the public carbohydrate structure databases and to create a comprehensive index of all available structures with cross-links back to the original databases. To achieve this goal, structures of the freely available databases were translated to the GlycoCT sequence format (7), if possible, and stored in a new database, the GlycomeDB (8). The integration process is performed incrementally on a weekly basis, updating the GlycomeDB with the newest structures available in the associated databases. A JAVA software application called GlycoUpdateDB, which is complemented by a PostgreSQL database, is used to download the data from the public databases, reads their sequence notations and translates them to the GlycoCT encoding format. In addition, the taxonomic annotations are standardized semi-automatically based on curated tables that map the (free-text) annotations used in the source databases to NCBI taxonomy IDs [for more details see (8)]. To extract the carbohydrate structures from the Protein Data Bank (PDB) the pdb2linux tool is used (9). During the integration process automated checks are performed; structures that contain errors are reported to the administrators of the original database. A major challenge during the initial integration process was the lack of a controlled vocabulary for carbohydrate and non-carbohydrate residue names. Even within a single database the same monosaccharide could have different names. In total 12 253 different residues names were extracted from the sequences stored in the original carbohydrate databases, 5854 of which were identified as non-carbohydrate residues, mainly aglycons, such as amino acids, lipids or other small organic molecules attached to the reducing end of the carbohydrate. In total 5330 residue names could be identified as monosaccharides and were assigned a standardized GlycoCT encoding. The remaining 1069 residue names could not be interpreted so far. Based on the initial analysis of the namespace used to encode carbohydrate structures in the various databases, a dictionary has been created that contains mappings of the various encoding formats. The dictionary is now used to support the automated update process. If a new residue name appears, this is reported to the database curator who can then check whether the residue name is valid and include the new residue into the dictionary. Finally, a web interface has been developed (www.glycome-db.org) as a single query point for all open access carbohydrate structure databases (10).

DATABASE CONTENT

GlycomeDB contains the unified carbohydrate sequences of all publicly accessible databases that contain carbohydrates structures. In total 121 766 original sequences were parsed and integrated. Currently (August 2010) there are 35 873 unique carbohydrate sequences—with taxonomic annotations if available—stored in GlycomeDB, 11 822 of which are fully determined carbohydrates. A carbohydrate structure is defined as ‘fully determined’ if all monosaccharide characteristics (base type, anomer, ring size, substituents, modifications, etc) and all linkage positions are known. For polysaccharides the number of repeating units needs to be determined as well. An overview of the number of carbohydrate structures contributed by each database is given in Table 1.

Table 1.

External database	Number of sequences in external database	Number of unique GlycoCT sequences	Fully determined carbohydrate sequences	URL
BCSDB (4)	8119	6536 (4149)	1972 (1277)	http://www.glyco.ac.ru/bcsdb3/
CCSD (2)	23 402	14 887 (1544)	7406 (462)	http://www.genome.jp/dbget-bin/www_bfind?carbbank
CFG (5)	8873	6285 (4143)	397 (110)	http://www.functionalglycomics.org/
EUROCarbDB	13 467	13 308 (411)	8924 (139)	http://www.ebi.ac.uk/eurocarb/
Glycobase(Lille) (11)	247	197 (145)	195 (143)	http://glycobase.univ-lille1.fr/base/
GLYCOSCIENCES.de (12)	23 285	15 829 (391)	9225 (36)	http://www.glycosciences.de/
KEGG (6)	10 969	10 160 (6128)	1610 (179)	http://www.genome.jp/kegg/glycan/
PDB (13)	905	733 (0)	708 (0)	http://www.rcsb.org/pdb/

The numbers in brackets denote the number of sequences that are stored exclusively in this database. Currently GlycomeDB contains 35 873 unique carbohydrate sequences and 11 822 fully determined carbohydrate sequences. See text for the criteria of a ‘fully determined sequence’.

Overview of the number of original unique carbohydrate or glycoconjugate sequences contained in the source databases (encoded in the database-specific format, including the aglycon unit) and the number of unique GlycoCT sequences generated after removing the aglycon and parsing the remaining code The numbers in brackets denote the number of sequences that are stored exclusively in this database. Currently GlycomeDB contains 35 873 unique carbohydrate sequences and 11 822 fully determined carbohydrate sequences. See text for the criteria of a ‘fully determined sequence’.

Data retrieval and presentation

Four major structural query options are implemented in GlycomeDB, namely ‘exact structure search’, ‘substructure search’, ‘similarity search’ and ‘maximum common substructure search’ (10). Structural queries can be entered graphically, either using GlycanBuilder (14) as the default, or using DrawRINGS, developed by a Japanese group at SOKA University, Tokyo (http://rings.t.soka.ac.jp). It is also possible to specify the query structure by using different machine-readable encoding formats, among which are CarbBank format (2), LINUCS (15), LinearCode® (16), BCSDB encoding (4) and Glyde II (http://glycomics.ccrc.uga.edu/core4/informatics-glyde-ii.html). Next to the exact structure search, which is based on a comparison of ordered GlycoCT encodings (7), it is possible to generate queries with partially unknown information on the monosaccharide level, i.e. unknown anomeric center, ring size, or absolute configuration. It is also possible to restrict the search to specific taxonomic sources, as GlycomeDB applies consistently the NCBI taxonomy for the taxonomic data (17). The various search options can be combined sequentially to a multistep query refinement workflow, which allows very complex queries to be performed. Using the GlycomeDB information page for individual structures (Figure 1), the user can use hyperlinks to navigate to the relevant pages of the external databases, which offer additional information such as literature references, experimental data or 3D structures. Additionally, information about bound aglycons and structural motifs, and a selectable sequence encoding are displayed. For more detailed information about the various aglycons attached to a particular carbohydrate, the user is guided to the original databases by following the link ‘Show remote structure evidences’.

Figure 1.

The structure information page of GlycomeDB. The carbohydrate is displayed in CFG-style cartoon representation. Species annotations and hyperlinks to external databases are available.

SUMMARY AND OUTLOOK

GlycomeDB integrates the structural and taxonomic data of all major public carbohydrate databases, as well as carbohydrates contained in the Protein Data Bank, which renders the database currently the most comprehensive and unified resource for carbohydrate structures worldwide. Hyperlinks to the original source of the data are established, so users can use the GlycomeDB Web-portal to access efficiently relevant additional information, which is only available in the original databases. GlycomeDB is a database that integrates knowledge from other existing databases, therefore only carbohydrate structures that are stored in any of these databases will be integrated and cross-linked in GlycomeDB. Unfortunately, GlycomeDB cannot provide access to all published structures because, in contrast to proteomics and genomics, in glycomics there is not yet a procedure established that requires deposition of new structures in the context of publication. Therefore it can be assumed that not all published structures are currently available in a database. However, if a public database will be used in the future to deposit systematically new structures, these structures should also be automatically available in GlycomeDB. In general, the quality of the data depends on the quality of the referenced databases and their curation processes. Nevertheless GlycoUpdateDB applies additional validation checks during the integration process in order to improve the quality of the data. The curated database can be downloaded and used freely by interested scientists. It can be assumed that the development of annotation tools in MS and NMR that require a library of existing carbohydrate structures as reference data will benefit from the availability of GlycomeDB. Additionally, the data contained in GlycomeDB can facilitate statistical analyses of the ‘glycospace’ of different organisms (18,19).

AVAILABILITY

GlycomeDB can be accessed using a Web-portal (http://www.glycome-db.org/) or the complete database can be downloaded as a compressed zip archive, containing all structures that have been integrated (http://www.glycome-db.org/downloads/). The structures are stored in regular XML files according to the Glyde II specification and can be used by any software that supports this format.

FUNDING

EU (6th Research Framework Program, RIDS contract number 011952); German Research Foundation (DFG BIB 46 HDdkz 01-01). Funding for open access charge: German Cancer Research Center (DKFZ), Heidelberg, Germany. Conflict of interest statement. None declared.

16 in total

Review 1. GLYCOSCIENCES.de: an Internet portal to support glycomics and glycobiology research.

Authors: Thomas Lütteke; Andreas Bohne-Lang; Alexander Loss; Thomas Goetz; Martin Frank; Claus-W von der Lieth
Journal: Glycobiology Date: 2005-10-20 Impact factor: 4.313

Review 2. Advancing glycomics: implementation strategies at the consortium for functional glycomics.

Authors: Rahul Raman; Maha Venkataraman; Subu Ramakrishnan; Wei Lang; S Raguram; Ram Sasisekharan
Journal: Glycobiology Date: 2006-02-14 Impact factor: 4.313

Review 3. KEGG as a glycome informatics resource.

Authors: Kosuke Hashimoto; Susumu Goto; Shin Kawano; Kiyoko F Aoki-Kinoshita; Nobuhisa Ueda; Masami Hamajima; Toshisuke Kawasaki; Minoru Kanehisa
Journal: Glycobiology Date: 2005-07-13 Impact factor: 4.313

4. SOACS index: an easy NMR-based query for glycan retrieval.

Authors: Emmanuel Maes; Fanny Bonachera; Gerard Strecker; Yann Guerardel
Journal: Carbohydr Res Date: 2008-11-12 Impact factor: 2.104

5. GlycoCT-a unifying sequence format for carbohydrates.

Authors: S Herget; R Ranzinger; K Maass; C-W V D Lieth
Journal: Carbohydr Res Date: 2008-03-13 Impact factor: 2.104

6. The Complex Carbohydrate Structure Database.

Authors: S Doubet; K Bock; D Smith; A Darvill; P Albersheim
Journal: Trends Biochem Sci Date: 1989-12 Impact factor: 13.807

7. Exploring the structural diversity of mammalian carbohydrates ("glycospace") by statistical databank analysis.

Authors: Daniel B Werz; René Ranzinger; Stephan Herget; Alexander Adibekian; Claus-Wilhelm von der Lieth; Peter H Seeberger
Journal: ACS Chem Biol Date: 2007-10-19 Impact factor: 5.100

8. The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data.

Authors: Helen Berman; Kim Henrick; Haruki Nakamura; John L Markley
Journal: Nucleic Acids Res Date: 2006-11-16 Impact factor: 16.971

9. GlycomeDB - integration of open-access carbohydrate structure databases.

Authors: René Ranzinger; Stephan Herget; Thomas Wetter; Claus-Wilhelm von der Lieth
Journal: BMC Bioinformatics Date: 2008-09-19 Impact factor: 3.169

10. The GlycanBuilder: a fast, intuitive and flexible software tool for building and displaying glycan structures.

Authors: Alessio Ceroni; Anne Dell; Stuart M Haslam
Journal: Source Code Biol Med Date: 2007-08-07

44 in total

Review 1. Integration of systems glycobiology with bioinformatics toolboxes, glycoinformatics resources, and glycoproteomics data.

Authors: Gang Liu; Sriram Neelamegham
Journal: Wiley Interdiscip Rev Syst Biol Med Date: 2015-04-13

Review 2. Modulating carbohydrate-protein interactions through glycoengineering of monoclonal antibodies to impact cancer physiology.

Authors: Austin Wt Chiang; Shangzhong Li; Philipp N Spahn; Anne Richelle; Chih-Chung Kuo; Mojtaba Samoudi; Nathan E Lewis
Journal: Curr Opin Struct Biol Date: 2016-09-14 Impact factor: 6.809

Review 3. Bioinformatics and systems biology of the lipidome.

Authors: Shankar Subramaniam; Eoin Fahy; Shakti Gupta; Manish Sud; Robert W Byrnes; Dawn Cotter; Ashok Reddy Dinasarapu; Mano Ram Maurya
Journal: Chem Rev Date: 2011-09-23 Impact factor: 60.622

4. Alteration of the serum N-glycome of mice locally exposed to high doses of ionizing radiation.

Authors: Thibault Chaze; Marie-Christine Slomianny; Fabien Milliat; Georges Tarlet; Tony Lefebvre-Darroman; Patrick Gourmelon; Eric Bey; Marc Benderitter; Jean-Claude Michalski; Olivier Guipaud
Journal: Mol Cell Proteomics Date: 2012-11-12 Impact factor: 5.911

Review 5. Online tools for bioinformatics analyses in nutrition sciences.

Authors: Sridhar A Malkaram; Yousef I Hassan; Janos Zempleni
Journal: Adv Nutr Date: 2012-09-01 Impact factor: 8.701

Review 6. Using databases and web resources for glycomics research.

Authors: Kiyoko F Aoki-Kinoshita
Journal: Mol Cell Proteomics Date: 2013-01-16 Impact factor: 5.911

7. Alterations in the proteome of the respiratory tract in response to single and multiple exposures to naphthalene.

Authors: Dietmar Kültz; Johnathon Li; Romina Sacchi; Dexter Morin; Alan Buckpitt; Laura Van Winkle
Journal: Proteomics Date: 2015-05-13 Impact factor: 3.984

8. Novel data analysis tool for semiquantitative LC-MS-MS2 profiling of N-glycans.

Authors: Hannu Peltoniemi; Suvi Natunen; Ilja Ritamo; Leena Valmu; Jarkko Räbinä
Journal: Glycoconj J Date: 2012-06-17 Impact factor: 2.916

9. Glycosylation Network Analysis Toolbox: a MATLAB-based environment for systems glycobiology.

Authors: Gang Liu; Apurv Puri; Sriram Neelamegham
Journal: Bioinformatics Date: 2012-12-10 Impact factor: 6.937

10. Integrated Proteomic and Glycoproteomic Characterization of Human High-Grade Serous Ovarian Carcinoma.

Authors: Yingwei Hu; Jianbo Pan; Punit Shah; Minghui Ao; Stefani N Thomas; Yang Liu; Lijun Chen; Michael Schnaubelt; David J Clark; Henry Rodriguez; Emily S Boja; Tara Hiltke; Christopher R Kinsinger; Karin D Rodland; Qing Kay Li; Jiang Qian; Zhen Zhang; Daniel W Chan; Hui Zhang
Journal: Cell Rep Date: 2020-10-20 Impact factor: 9.423