Literature DB >> 17202164

Sharing of worldwide distributed carbohydrate-related digital resources: online connection of the Bacterial Carbohydrate Structure DataBase and GLYCOSCIENCES.de.

Philip Toukach¹, Hiren J Joshi, René Ranzinger, Yuri Knirel, Claus-W von der Lieth.

Abstract

Functional glycomics, the scientific attempt to identify and assign functions to all glycan molecules synthesized by an organism, is an emerging field of science. In recent years, several databases have been started, all aiming to support deciphering the biological function of carbohydrates. However, diverse encoding and storage schemes are in use amongst these databases, significantly hampering the interchange of data. The mutual online access between the Bacterial Carbohydrate Structure DataBase (BCSDB) and the GLYCOSCIENCES.de portal, as a first reported attempt of a structure-based direct interconnection of two glyco-related databases is described. In this approach, users have to learn only one interface, will always have access to the latest data of both services, and will have the results of both searches presented in a consistent way. The establishment of this connection helped to find shortcomings and inconsistencies in the database design and functionality related to underlying data concepts and structural representations. For the maintenance of the databases, duplication of work can be easily avoided, and will hopefully lead to a better worldwide acceptance of both services within the community of glycoscienists. BCSDB is available at http://www.glyco.ac.ru/bcsdb/ and the GLYCOSCIENCES.de portal at http://www.glycosciences.de/.

Entities: Chemical Gene Species

Mesh：

Substances：

Year: 2007 PMID： 17202164 PMCID： PMC1899093 DOI： 10.1093/nar/gkl883

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Functional glycomics is an emerging field of science, aiming to create a cell-by-cell catalogue of glycosyltransferase expression and detected glycan structures in relation to health and diseases (1–3). Analysis of glycans has proven difficult in the past due to their structural complexity. However, modern analytical methods such as mass spectrometry and NMR have afforded the ability to elucidate most structural details at the concentration levels required for glycomics (4,5). Several national and international initiatives aiming to decipher the biological function of carbohydrates have emerged for the recent years (6–8). In a similar fashion to the finished Human Genome Project—which determined the sequences of the chemical base pairs that make up a human DNA—most of these glycomics projects intend to make their data freely accessible under an open access philosophy. Unfortunately, the exchange of data between different glyco-related databases is seriously hampered by the dearth of generally accepted digital exchange formats and standardized structural and biological descriptions (9). Similar to the genomics and proteomics field, a description of glycan structures would be an appropriate way to establish an efficient connection of glyco-related information resources. However, glycan sequences cannot be described by a simple linear one-letter code as each pair of monosaccharides can be linked in several ways and branched structures can be formed. The GLYCOSCIENCES.de portal (7) demonstrates that data originating from various resources can be efficiently integrated using a linear notation for unique description of carbohydrate sequences (LINUCS) (10). The extended alphanumeric IUPAC description and glycosidic linking information are applied to build up a hierarchy of the various branches starting from the reducing end of the oligosaccharide chain, which is then converted into a linear representation. However, other larger projects use different ways to encode glycan structures. The commercially available GlycoSuiteDB (11) uses the so-called condensed form of the IUPAC description to create a linear representation, where four rules are applied to obtain a unique linear code. The ‘Glycan Database’ of the US Consortium for Functional Glycomics (6) uses the so-called Linear Code™ (12), using a one or two character-based representation of saccharide units and linkages. The ordering of glycan branches is established using a special lookup table where the hierarchy of monosaccharide structures is defined. The KEGG Carbohydrate Matcher (KCaM) (8,13) uses a connection table based graph representation to encode carbohydrate structures, where monosaccharides are represented by nodes and glycosidic bonds as edges. GLYCOSCIENCES.de, GlycoSuite, CFG Glycan Database and KEGG-Glycan concentrate on glycan structures found in mammalian species. In contrast, the mission of the Russian Bacterial Carbohydrate Structure DataBase (BCSDB) [(14), for URLs see Appendix] is to provide all published glycan structures found in bacteria. Since the monosaccharide namespace as well as the type of linkages found in bacterial polysaccharides differ considerably from those found in mammals, BCSDB uses an internal representation of glycans, which diverges from those used to describe structures found in mammals. Looking at various existing carbohydrate databases accessible through the Internet, it is obvious that diverse ways to encode and store complex carbohydrates are in use. They all seem to work satisfactorily for the purpose they have been designed for. However, users who would like to access all publicly available glyco-related data spread over many databases have not only to cope with varying graphical and non-graphical interfaces to input glycan structures, but also must be aware that the definition of building blocks and topologies may be different. Each database has developed its own set of rules to solve some problematic encoding situations such as treatment of monovalent substituents, phosphates, sulphates, repeat units, unknown linkages and other uncertain structural features of glycan structures. It is, of course, an attractive vision [expressed during the Joint Meeting of the Japanese and American Consortia for Glycomics (15)] to have a single user interface which will provide access to all relevant world-wide distributed resources without any technical and administrative barrier. A prerequisite for an efficient exchange of data is the agreement to a generally accepted exchange format as well as a common application programming interface. Consequently, several proposals for an XML-based description of glycan structures have already been published (16,17). To avoid any further confusion about XML descriptions of glycans, the seven larger initiatives in this field [CFG, BCSDB, GLYCOSCIENCES.de, EUROCarbDB, KEGG, HGPI and CCRC (for abbreviations see Appendix)] agreed to further develop the XML description for the encoding of glycan structures on the basis of the already existing GLYcan Data Exchange (GLYDE) (17). The progress discussion is open to all interested scientists and takes place at the forum pages of the EUROCarbDB project. Concerning the technical realization of the online connection between existing databases, it seems that the Simple Object Access Protocol (SOAP) is now the broadly accepted procedure for automated communication between web-applications. Being designed to communicate via the Internet, it is well suited to be also used for the exchange of glycan-related data between distributed computers. Taken together, it seems like the field has matured to the point where it is feasible to establish an online connection of distributed databases, at least between the larger of the established projects.

BACTERIAL CARBOHYDRATE STRUCTURE DATABASE

The Bacterial Carbohydrate Structure DataBase (BCSDB) [(14), for URLs see Appendix] is a database containing data on natural carbohydrates with known structure. In addition to the structure and bibliography, each record in the BCSDB contains the abstract of the publication, data on the carbohydrate source, methods of structure elucidation, information on the availability of spectral data and assignment of NMR spectra when available, data on conformation, biological activity, chemical and enzymatic synthesis, biosynthesis, genetics and other related data. The search criteria can be fragment(s) of the structure; fragment(s) of the NMR spectrum; and indexed tags, including microorganism, bibliography and keywords. Currently, the BCSDB contains ∼8200 records on bacterial carbohydrates, including the corresponding part of CarbBank (18) (∼3500 records on structures reported before 1995). This coverage is approaching the total number of bacterial carbohydrate structures ever reported. Data from both literature and CarbBank have been carefully checked for consistency before the upload, and corrected when necessary. The BCSDB interface includes the web-based user part, web-based administrator part and programming gateways for the automated data interchange. The BCSDB is available on the Internet for free usage and validated user data submission.

GLYCOSCIENCES.de

The GLYCOSCIENCES.de portal (7) is an attempt to link glycan-related data originating from various resources through a unique structural description. The LINUCS (LInear Notation for Unique description of Carbohydrate Sequences) (10) notation is used to uniquely encode fully characterized glycans. Currently, the GLYCOSCIENCES portal provides access to ∼24 000 different entries with nearly 14 000 different carbohydrate moieties. These structures are sourced from a number of sources, including the former CarbBank and SugaBase-project (19), automatic extraction from the Protein Data Base (PDB) (20), and the curation of new entries altogether. The structure-oriented approach to the database allows the data related to a single glycan, but originating from various sources (e.g. experimental NMR spectra, theoretically calculated fragment ions for mass spectra interpretation or experimental or simulated 3D structures) to be easily linked and accessed using a single database query. According to the varying needs of specific research questions, the GLYCOSCIENCES portal provides several structure-oriented options to recall glycan-related data. Substructure searches are the most frequently used way to look for glycan structures. The retrieval of glycans matching an exact structure is the most traditional way to access a database. The motif search enables to retrieve all entries, which possess substructures having names such as LewisX, blood group H antigen or GM3. All glycan-related scientific data of the GLYCOSCIENCES.de portal are freely accessible via the Internet following the open access philosophy: ‘free availability and unrestricted use’.

WEB-SERVICES

The SOAP-based web-services are available on the websites of the two projects and are documented in the form of WSDL (for URLs see Appendix) descriptions that provide the possibility of platform-independent formalization of server-side features. WSDL files can be easily integrated into the existing code by using features from various SOAP libraries which allow the transparent work with the SOAP interface under Perl, PHP, Java, etc. Additionally sample PHP clients are available.

DATA TRANSFER FORMATS

GLYcan Data Exchange (GLYDE) version 1.2 (17) was chosen as the structure exchange format. It supports almost all known peculiarities of carbohydrate structures, such as uncertainities in configuration and ring sizes, various combinations of repeating and non-repeating parts, non-carbohydrate linkers, cyclic structures, etc.. GLYDE uses a tree-based approach to structure description. Within this approach the tree root is the reducing and or the rightmost residue in the repeating unit, while all the substituents are the ‘children’ of the residue they are attached to. Configurations, ring-size and other related information is stored as attributes of the residue. The syntax of GLYDE is XML. To transfer the bibliographic information two approaches are used: the raw data (as array of strings corresponding to authors, title terms, journal name, etc.) or PubMed XML. BCSDB supports both formats, while GLYCOSCIENCES.de currently supports only the former. The former format is simpler in realization but the latter provides more standardization. PubMed XML encodes the bibliographic information using the strictly defined set of rules. More information is available at the NCBI PubMed XML tagged data homepage. A well-known identifier for an organism is a TaxID provided by NCBI taxonomy database. Both databases provide the search mechanism that uses NCBI TaxID to identify the microorganism. However, the ranking of TaxID is limited to species; thus, no possibility to cross-search for particular strains/serogroups is provided. As this detailed ranking is significant mainly for bacteria, the capability to perform deep species searching is only supported on the BCSDB side of the connection. TaxIDs are stored in the GLYCOSCIENCES.de database together with structures, while BCSDB generates TaxIDs based on genus and species name, making use of an NCBI web service.

EXAMPLES

Three examples are given to demonstrate the established interconnection of both data collections. Example 1, using the bibliographic search of GLYCOSCIENCES.de, shows all references found in both resources for author ‘Brade’ in year 2002. GLYCOSCIENCES.de has included only two papers, where NMR spectra are reported. BCSDB lists another five papers where the structures of bacterial polysaccharides are described. Example 2 depicts a substructure search containing a specified disaccharide fragment [α-d-Neup5NAc-(2-3)-β-d-Galp] in GLYCOSCIENCES.de. The structure input option implemented in BCSDB (see Example 2a) is used. The data associated with two entries containing the disaccharide fragment are shown in Example 2b. Example 3 demonstrates a substructure search in BCSDB using GLYCOSCIENCES.de to input the trisaccharide fragment α-d-Galp-(1-3)-α-d-Manp-(1-4)-α-l-Rhap.

Example 1

Example 2a

Substructure Search for a-d-Neup5NAc-(2-3)-b-d-Galp in Glycosciences.de using the BCSDB Input wizard.

Example 2b

Result querying GLYCOSCIENCES for all structures containing a specified disaccharide fragment α-d-Neup5Ac-(2→3)β-d-Galp. The data associated with two entries are shown.

Example 3

Querying BCSDB for all structures containing the specified trisaccharide fragment α-d-Galp-(1-3)-α-d-Manp-(1-4)-α-L-Rhap. The GLYCOSCIENCES.de substructure input spreadsheet is used. The data associated with BCSDB entry 10147 are additionally shown.

Retrieval request of references of author ‘Brade’ in year 2002 in any journal. Used was the GLYCOSCIENCES.de advanced bibliographic search (only results are shown). References from BCSDB contain one structure each. Substructure Search for a-d-Neup5NAc-(2-3)-b-d-Galp in Glycosciences.de using the BCSDB Input wizard. Result querying GLYCOSCIENCES for all structures containing a specified disaccharide fragment α-d-Neup5Ac-(2→3)β-d-Galp. The data associated with two entries are shown. Querying BCSDB for all structures containing the specified trisaccharide fragment α-d-Galp-(1-3)-α-d-Manp-(1-4)-α-L-Rhap. The GLYCOSCIENCES.de substructure input spreadsheet is used. The data associated with BCSDB entry 10147 are additionally shown.

CONCLUSIONS

The capability of web services to make distributed scientific data accessible is clearly demonstrated. To our knowledge, the implemented mutual online access between BCSDB and GLYCOSCIENCES.de is the first reported attempt of a structure-based interconnection of two glyco-related databases. For users the advantages are obvious: they can use and have to learn only one interface, always have access to the latest data from both services, and the results of both searches are presented in a consistent way. For the database design and its functionality the establishment of a connection helped to find shortcomings and inconsistencies in both underlying data concepts and structural representations. For the maintenance of the databases, duplication of work can be easily avoided. It can be expected that more frequent use of both services will improve the quality of data. This will hopefully lead to a better worldwide acceptance of both services within the community of glycoscientists. Since the exchange of data is accomplished through standard, well-documented XML-based descriptions and SOAP protocols; other interested providers of glyco-related databases may easily be linked so that a larger network could grow. It can be envisaged that online connection of thematically related scientific data collections will have a bright future, and not only in the area of glycosciences. One of the main bottlenecks is currently that broadly accepted standard XML exchange formats are often not yet available. It will definitively be a time-consuming task to come to agreements about such standard descriptions within the various communities. With GLYDE 1.2 an XML-based encoding scheme of glycan structures exists, which is sufficiently flexible to link the vast majority of structures contained in BCSDB and GLYCOSCIENCES.de. However, GLYDE 1.2 has some shortcomings regarding uncertainties in terminal residues and other fuzzy encodings, which will become more important for glycomics projects. The current focus of discussion is to base a more flexible encoding on the concept of a connection table approach, instead of a tree-like structure as used in GLYDE 1.2. Recently (September 2006, NIH Meeting ‘Frontier in Glycomics’), the seven larger projects already mentioned above have agreed to support GLYDE-CT as the main database format for the exchange of glycan structures. The hope is of course that only one format will be used by everyone. A less favourable situation would be that several exchange format exit and parsers must be available for each database.

16 in total

1. GlycoSuiteDB: a curated relational database of glycoprotein glycan structures and their biological sources. 2003 update.

Authors: Catherine A Cooper; Hiren J Joshi; Mathew J Harrison; Marc R Wilkins; Nicolle H Packer
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

2. Data mining the protein data bank: automatic detection and assignment of carbohydrate structures.

Authors: Thomas Lütteke; Martin Frank; Claus-W von der Lieth
Journal: Carbohydr Res Date: 2004-04-02 Impact factor: 2.104

3. The carbohydrate sequence markup language (CabosML): an XML description of carbohydrate structures.

Authors: Norihiro Kikuchi; Akihiko Kameyama; Shuuichi Nakaya; Hiromi Ito; Takashi Sato; Toshihide Shikanai; Yoriko Takahashi; Hisashi Narimatsu
Journal: Bioinformatics Date: 2004-11-25 Impact factor: 6.937

Review 4. Proteomic analysis of glycosylation: structural determination of N- and O-linked glycans by mass spectrometry.

Authors: David J Harvey
Journal: Expert Rev Proteomics Date: 2005-01 Impact factor: 3.940

Review 5. A genetic approach to Mammalian glycan function.

Authors: John B Lowe; Jamey D Marth
Journal: Annu Rev Biochem Date: 2003-03-27 Impact factor: 23.643

6. LINUCS: linear notation for unique description of carbohydrate sequences.

Authors: A Bohne-Lang; E Lang; T Förster; C W von der Lieth
Journal: Carbohydr Res Date: 2001-11-01 Impact factor: 2.104

7. Bioinformatics for glycomics: status, methods, requirements and perspectives.

Authors: Claus-Wilhelm von der Lieth; Andreas Bohne-Lang; Klaus Karl Lohmann; Martin Frank
Journal: Brief Bioinform Date: 2004-06 Impact factor: 11.622

8. A 1H NMR database computer program for the analysis of the primary structure of complex carbohydrates.

Authors: J A van Kuik; K Hård; J F Vliegenthart
Journal: Carbohydr Res Date: 1992-11-04 Impact factor: 2.104

9. The Complex Carbohydrate Structure Database.

Authors: S Doubet; K Bock; D Smith; A Darvill; P Albersheim
Journal: Trends Biochem Sci Date: 1989-12 Impact factor: 13.807

10. KCaM (KEGG Carbohydrate Matcher): a software tool for analyzing the structures of carbohydrate sugar chains.

Authors: Kiyoko F Aoki; Atsuko Yamaguchi; Nobuhisa Ueda; Tatsuya Akutsu; Hiroshi Mamitsuka; Susumu Goto; Minoru Kanehisa
Journal: Nucleic Acids Res Date: 2004-07-01 Impact factor: 16.971

13 in total

1. GlycoRDF: an ontology to standardize glycomics data in RDF.

Authors: Rene Ranzinger; Kiyoko F Aoki-Kinoshita; Matthew P Campbell; Shin Kawano; Thomas Lütteke; Shujiro Okuda; Daisuke Shinmachi; Toshihide Shikanai; Hiromichi Sawaki; Philip Toukach; Masaaki Matsubara; Issaku Yamada; Hisashi Narimatsu
Journal: Bioinformatics Date: 2014-11-11 Impact factor: 6.937

Review 2. Using databases and web resources for glycomics research.

Authors: Kiyoko F Aoki-Kinoshita
Journal: Mol Cell Proteomics Date: 2013-01-16 Impact factor: 5.911

3. GlyTouCan: an accessible glycan structure repository.

Authors: Michael Tiemeyer; Kazuhiro Aoki; James Paulson; Richard D Cummings; William S York; Niclas G Karlsson; Frederique Lisacek; Nicolle H Packer; Matthew P Campbell; Nobuyuki P Aoki; Akihiro Fujita; Masaaki Matsubara; Daisuke Shinmachi; Shinichiro Tsuchiya; Issaku Yamada; Michael Pierce; René Ranzinger; Hisashi Narimatsu; Kiyoko F Aoki-Kinoshita
Journal: Glycobiology Date: 2017-10-01 Impact factor: 4.313

Review 4. Bioinformatics and molecular modeling in glycobiology.

Authors: Martin Frank; Siegfried Schloissnig
Journal: Cell Mol Life Sci Date: 2010-04-04 Impact factor: 9.261

5. Cell surface of Lactococcus lactis is covered by a protective polysaccharide pellicle.

Authors: Marie-Pierre Chapot-Chartier; Evgeny Vinogradov; Irina Sadovskaya; Guillaume Andre; Michel-Yves Mistou; Patrick Trieu-Cuot; Sylviane Furlan; Elena Bidnenko; Pascal Courtin; Christine Péchoux; Pascal Hols; Yves F Dufrêne; Saulius Kulakauskas
Journal: J Biol Chem Date: 2010-01-27 Impact factor: 5.157

6. Functional network of glycan-related molecules: glyco-net in glycoconjugate data bank.

Authors: Ryo Hashimoto; Kazuko Hirose; Taku Sato; Nobuhiro Fukushima; Nobuaki Miura; Shin-Ichiro Nishimura
Journal: BMC Syst Biol Date: 2010-06-29

7. The use of glycoinformatics in glycochemistry.

Authors: Thomas Lütteke
Journal: Beilstein J Org Chem Date: 2012-06-21 Impact factor: 2.883

8. GlycomeDB - integration of open-access carbohydrate structure databases.

Authors: René Ranzinger; Stephan Herget; Thomas Wetter; Claus-Wilhelm von der Lieth
Journal: BMC Bioinformatics Date: 2008-09-19 Impact factor: 3.169

9. PTM-SD: a database of structurally resolved and annotated posttranslational modifications in proteins.

Authors: Pierrick Craveur; Joseph Rebehmed; Alexandre G de Brevern
Journal: Database (Oxford) Date: 2014-05-24 Impact factor: 3.451

10. Introducing glycomics data into the Semantic Web.

Authors: Kiyoko F Aoki-Kinoshita; Jerven Bolleman; Matthew P Campbell; Shin Kawano; Jin-Dong Kim; Thomas Lütteke; Masaaki Matsubara; Shujiro Okuda; Rene Ranzinger; Hiromichi Sawaki; Toshihide Shikanai; Daisuke Shinmachi; Yoshinori Suzuki; Philip Toukach; Issaku Yamada; Nicolle H Packer; Hisashi Narimatsu
Journal: J Biomed Semantics Date: 2013-11-26