Literature DB >> 15180909

pdb-care (PDB carbohydrate residue check): a program to support annotation of complex carbohydrate structures in PDB files.

Abstract

BACKGROUND: Carbohydrates are involved in a variety of fundamental biological processes and pathological situations. They therefore have a large pharmaceutical and diagnostic potential. Knowledge of the 3D structure of glycans is a prerequisite for a complete understanding of their biological functions. The largest source of biomolecular 3D structures is the Protein Data Bank. However, about 30% of all 1663 PDB entries (version September 2003) containing carbohydrates comprise errors in glycan description. Unfortunately, no software is currently available which aligns the 3D information with the reported assignments. It is the aim of this work to fill this gap.
RESULTS: The pdb-care program http://www.glycosciences.de/tools/pdb-care/ is able to identify and assign carbohydrate structures using only atom types and their 3D atom coordinates given in PDB-files. Looking up a translation table where systematic names and the respective PDB residue codes are listed, both assignments are compared and inconsistencies are reported. Additionally, the reliability of reported and calculated connectivities for molecules listed within the HETATOM records is checked and unusual values are reported.
CONCLUSION: Frequent use of pdb-care will help to improve the quality of carbohydrate data contained in the PDB. Automatic assignment of carbohydrate structures contained in PDB entries will enable the cross-linking of glycobiology resources with genomic and proteomic data collections.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Carbohydrates

Year: 2004 PMID： 15180909 PMCID： PMC441419 DOI： 10.1186/1471-2105-5-69

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

Carbohydrates are involved in a variety of fundamental biological processes (cellular differentiation, embryonic development, fertilization) and pathological situations (bacterial and viral infections, inflammatory diseases, cancer) [1-3]. They therefore have a large pharmaceutical and diagnostic potential. Protein-carbohydrate interactions are intensively investigated using a variety of experimental methods. Among these, X-ray and NMR measurements provide a detailed 3D picture of the spatial location of the ligand as well as the protein. About 70% of all proteins deposited in sequence databases show potential N-glycosylation sites, which can be identified by the presence of the Asn-X-Ser/Thr sequon [4,5]. For reasons that are still unclear, not all such sequons are glycosylated. It is estimated that more than half of all the proteins in the human body have carbohydrate molecules attached (see Fig. 1a). Unfortunately, extensive analytical procedures are required to determine which of these potential sites are occupied, even if the protein in question is known to be glycosylated. Due to several reasons, the absence of carbohydrate data in 3D structural data taken from X-ray crystallography does not necessarily mean that a potential glycosylation site is unoccupied; but the presence of carbohydrate residues in the 3D structures provides direct and unambiguous evidence for the occupancy of a glycosylation site. Therefore, these data have been intensively used to gain deeper insights into factors that regulate glycan attachment to a N-glycosylation site [6].

Figure 1

Description of carbohydrate structures (a) Typical PDB-entry (PDB-code: 1axy) with an attached N-Glycan given in spacefilling representation. (b) Ball-and-Stick 3D-representation of the same N-Glycan. For a simple comparison of the different representations, the same colour code for each residue is used for Figure 1b-d. (c) LINUCS representation of the same N-Glycan structure. (d) IUPAC-like description of the same N-Glycan.

Among the 25667 PDB entries (May 2004) [7], 1995 structures were detected which contain a total of 6714 carbohydrate chains. About half of them are covalently attached, the other half belongs to non-covalently bound ligands. In a recent study, however, it was found that about 30% of all PDB entries containing carbohydrates comprise one or several errors in glycan description, which are mainly due to wrong assignment of saccharide units [8]. The reason for this unacceptable high rate of errors is obvious. Sequences for complex carbohydrates differ significantly from the simple linear one-letter code used to describe genes and proteins: a) the number of naturally occurring residues is much larger for glycans, b) each pair of monosaccharide residues can be linked in several ways, and c) a residue can be connected to three or four others (branching). Unfortunately, no simple representation of complex carbohydrates does exist that is accepted by scientists from all disciplines such as the one-letter code of amino acid sequences for proteins. Sugar resides present in the PDB are defined in the so-called HET Group Dictionary [9]. Since a three-letter code is used to uniquely assign carbohydrates, a new residue name is required for each stereochemically different sugar unit and each substitution. This procedure makes correct assignment of sugar units tedious, complicated and obviously error-prone. Unfortunately, no software is currently available which automatically aligns the 3D information contained in PDB and the given assignments. It is the focus of pdb-care [10] to fill this gap.

Implementation

The pdb-care [10] program is based on the carbohydrate detection software pdb2linucs [8] that is able to identify and assign carbohydrate structures using only the reported atom types and their 3D atom coordinates. The LINUCS-notation [11], which is closely related to the IUPAC nomenclature recommendation for carbohydrates [12], is used to normalise complex carbohydrate structures (Fig. 1c and 1d). To be able to compare the detected carbohydrate structures in LINUCS notation to the residue assignments as reported in the PDB HET Group Dictionary [9] a translation table in XML format was created, where both descriptions are confronted. Three types of residues are included: monosaccharides, oligosaccharides and combined residues consisting of a carbohydrate moiety and a non-carbohydrate part, e.g. BOG standing for octyl-β-D-Glcp (see table 1 for definitions of PDB residue names mentioned in this article). The translation table actually contains 141 monosaccharides, 31 oligosaccharides and 77 combined residues and is still growing. The pdb-care [10] protocol reports the type of problems, inconsistencies and errors detected. The messages are classified as info, describing the type of checks performed, warnings when non-resolvable discrepancies are detected and errors when obviously wrong assignments have been found.

Table 1

PDB carbohydrate residues (examples): Definitions of carbohydrate residues used in the Protein Data Bank. PDB residue names are defined using a three-letter encoding. There are more than 200 different carbohydrate residue names used in PDB entries. This table lists those mentioned in this article and some further examples in alphabetic order.

Name	Definition
AFL	alpha-L-Fucose
AGC	alpha-D-Glucopyranose
BGC	beta-D-Glucopyranose
BOG	octyl-beta-D-Glucopyranose
FCA	alpha-D-Fucose
FCB	beta-D-Fucose
FMF	2-deoxy-2-fluoro alpha-D-Mannopyranose
FUC	Fucose
FUL	beta-L-Fucose
G4S	D-Galactose-4-sulphate
GAL	D-Galactopyranose
GLA	alpha-D-Galactopyranose
GLB	beta-D-Galactopyranose
GLC	D-Glucopyranose
GLS	beta-D-Glucopyranose spirohydantoin [also used for D-Galactopyranose-6-sulphate]
GSA	D-Galactose-4-sulphate
LAK	Allolactose [b-D-Galp-(1-6)-b-D-Glcp]
LAT	Lactose [b-D-Galp-(1-4)-b-D-Glcp]
MAF	2-deoxy-2-fluoro alpha-D-Mannopyranose
MAL	Maltose [a-D-Glcp-(1-4)-a-D-Glcp]
NAG	N-acetyl D-Glucosamin
NAN	5-N-acetyl alpha-D-Neuraminic Acid
NGA	N-acetyl D-Galactosamin
SIA	5-N-acetyl D-Neuraminic Acid (Sialic acid)
SLB	5-N-acetyl beta-D-Neuraminic Acid

pdb-care is written in C. Interaction with the user is done through a web-interface, which is implemented in PHP. The pdb-care [10] service is hosted at the central spectroscopic department of the German Cancer Research Centre in Heidelberg, Germany.

Results and discussion

The pdb-care [10] web interface allows either to analyse a file obtained directly from PDB using the PDB-ID, or to provide a pdb-file located on the local computer by upload or by copy / paste into the provided input window. Since carbohydrate structures are described as so-called hetero atoms within the HETATM records, all data assigned to the ATOM records (amino acids, nucleotides) are neglected.

Connectivity check

The examination of information given in the CONECT records of the pdb-files comprises two types of checks, which can be separately activated by the user. The first check analyses the reported connections. If a bond length differs for more than the user-definable tolerance from the normal length or if the number of connections for an atom exceeds the maximally allowed number for the respective element, a warning is displayed (Fig. 2a and 2b).

Figure 2

Erroneous connections in PDB entries. Besides missing connection information, some entries contain surplus connections. (a) In case the wrongly connected atoms are far distant from each other, these errors can be observed on the first view (PDB entry 1qoo, residue NAG401A). In this example, the spuriously assigned connections result in a hexavalent carbon atom. (b) Surplus connections ranging on short distances are much more difficult to discover by visual inspection (PDB entry 1bcs, residues NAG1051, NAG1052).

The second check analyses the reported data for completeness. For each hetero-atom its distances to all other atoms are determined. In case this distance is within the usual bond length range, and the connection is not listed in the CONECT records, a warning is displayed. In some pdb-files, there are overlapping residues, e.g. due to the fact that the crystal was soaked with both the alpha- and the beta-anomer of a ligand, like the residues FCA307 and FCB308 of PDB entry 1abf. In this case, the determination of bonds by atom distances may generate wrong positive warnings. This may also happen when there are atom pairs in the structure that lie close together but are not covalently bound. Therefore, bond length inconsistencies are reported as warnings and not as errors. It is up to the user to finally decide if the reported problems are in fact owing to incorrect information within the CONECT records. Since the coordination of ions depends on various electronic factors, pdb-care can be configured in such a way that all bonds to ions are ignored. Both connectivity checks analyse the entire data given in the HETATM records and are therefore not limited to carbohydrate residues.

Nomenclature check

Mismatches between residue nomenclature and determined structure are the most common type of errors found within carbohydrate residues in the PDB [8]. This type of error probably results from the facts that the number of available carbohydrate residues by far exceeds the number of amino acids or nucleotides. Additionally, the individual residues often differ from each other only through the orientation of hydroxyl groups attached to ring carbons (Fig. 1b). These small structural differences make it difficult to correctly assign the stereochemistry of sugar units. pdb-care generates carbohydrate nomenclature based only on the given atom types and their 3D atom coordinates using the pdb2linucs algorithm [8]. Subsequently, looking up the translation table where systematic names and the respective PDB residue codes are listed, both independently derived assignments are compared. In case they do not coincide, an error message is reported. The correct PDB residue code corresponding to the detected monosaccharide structure will be provided on demand. This option facilitates subsequent corrections of assignments. Residue nomenclature in pdb-files is complicated by the fact that there is a large amount of ambiguities and redundancies. On the one hand there are many residues that stand for both the alpha- and the beta-anomer of one monosaccharide type, e.g. GAL for D-Galactose or GLC for Glucose. In an older version of the PDB HET Group Dictionary (March 2002), GLC was defined as 'alpha-D-Glucose', but since there was no residue name for the beta-anomer defined, it was often used for beta-D-Glucose residues as well. In the actual HET Group Dictionary, there are two further residues specifying the two Glucose anomers: AGC for α-D-Glcp and BGC for β-D-Glcp. Some residue names are even used for two entirely different residues. GLS, for example, is defined as 'beta-D-Glucopyranose spirohydantoin', but in entry 1kes, it was defined as 'D-Galactose-6-sulphate'. On the other hand, there are several residues for which more than one PDB residue name exists. 'D-Galactose-4-sulphate', e.g., is encoded by both GSA and G4S; and MAF as well as FMF encode for '2-deoxy-2-fluoro-alpha-D-Mannose'. To reduce the amount of ambiguously defined residues, pdb-care offers an option to suggest unique residue names – provided they are available – when ambiguous ones are reported in the PDB entry. As already discussed for the interpretation of conflicting connectivity information, again it is up to the experimentalist to judge if inconsistencies in nomenclature are caused just by selecting a wrong PDB residue name or if they are due to erroneous interpretation of the electron density maps. This decision can hardly be solved automatically by a software. The experimentalist has to decide if, for example, a residue that is detected as 'D-GlcpNAc' (PDB residue code NAG), but is named as NGA (defined as 'D-GalpNAc') is in fact 'D-GlcpNAc' and was just wrongly named, or as said in the PDB residue code is a NGA, which would mean that there is a problem with the structure.

Linkage information

In most pdb-files, information on how the single monosaccharides are linked to each other, to the proteins or to further, non-carbohydrate residues is entirely missing or present in a non-standardised way within the REMARK sections. Therefore, pdb-care verifies linkages only for PDB residues encoding for oligosaccharides, where linkage data is given implicitly within the residue definition. However, the pdb2linucs algorithm is able to generate a complete structural description of the complete glycan, which can be easily transformed to the IUPAC nomenclature (Fig. 1c,1d).

Conclusions

The currently available check software for molecular 3D structures in PDB like WhatCheck [13] or ProCheck [14,15] is focussed on the protein part of pdb-files. Intensive use of these programs has led to a high quality of the annotation of protein structures deposited in the PDB. The lack of a corresponding software for carbohydrate residues results in a high rate of assignment errors for this part of PDB information. It can be anticipated, that frequent use of pdb-care will improve the quality of carbohydrate data contained in the PDB. This enhancement of the glyco-related information will make it more reliable for the realm of glycobiology. The automatic assignment of carbohydrate structures contained in PDB entries will improve the cross-linking of glycobiology resources with genomic and proteomic data collections, which will be an important issue of the upcoming glycomics projects. Due to the current high rate of errors in the carbohydrate parts of PDB structures, however, it is not possible to extract this data from pdb-files and include it into carbohydrate databases without any quality control. The existence of a check program makes it feasible to update GlycosciencesDB – the former SweetDB [16] – automatically with data obtained from PDB in case there are no inconsistencies detected. For entries that are not included automatically, the software aids the database administrator in judging if a structure should be accepted or not. To further improve the quality as well as the accessibility of glyco-related data contained in PDB entries, a complete structural description including stereochemistry as well as linkage information of a glycan should be reported. With pdb-care and pdb2linucs two software tools are now available which produce such a description automatically based on the data already contained in PDB entries.

Availability and requirements

• Project name: PDB CArbohydrate REsidue check (pdb-care) • Project home page: • Operating systems: Platform independent • Programming language: C, PHP • Other requirements: none • Any restrictions to use by non-academics: none

Abbreviations

PDB: Protein Data Bank XML: extensible markup language PHP: PHP hypertext preprocessor

Authors' contributions

TL was responsible for the software design and implementation. CWvdL contributed to aspects of design and oversaw and managed contributions to the project. Both authors contributed equally drafting the manuscript and read and approved the final manuscript.

11 in total

Review 1. On the frequency of protein glycosylation, as deduced from analysis of the SWISS-PROT database.

Authors: R Apweiler; H Hermjakob; N Sharon
Journal: Biochim Biophys Acta Date: 1999-12-06

2. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

3. Data mining the protein data bank: automatic detection and assignment of carbohydrate structures.

Authors: Thomas Lütteke; Martin Frank; Claus-W von der Lieth
Journal: Carbohydr Res Date: 2004-04-02 Impact factor: 2.104

4. Stereochemical quality of protein structure coordinates.

Authors: A L Morris; M W MacArthur; E G Hutchinson; J M Thornton
Journal: Proteins Date: 1992-04

5. SWEET-DB: an attempt to create annotated data collections for carbohydrates.

Authors: Alexander Loss; Peter Bunsmann; Andreas Bohne; Annika Loss; Eberhard Schwarzer; Elke Lang; Claus-W von der Lieth
Journal: Nucleic Acids Res Date: 2002-01-01 Impact factor: 16.971

6. Errors in protein structures.

Authors: R W Hooft; G Vriend; C Sander; E E Abola
Journal: Nature Date: 1996-05-23 Impact factor: 49.962

Review 7. Glycobiology.

Authors: T W Rademacher; R B Parekh; R A Dwek
Journal: Annu Rev Biochem Date: 1988 Impact factor: 23.643

8. LINUCS: linear notation for unique description of carbohydrate sequences.

Authors: A Bohne-Lang; E Lang; T Förster; C W von der Lieth
Journal: Carbohydr Res Date: 2001-11-01 Impact factor: 2.104

9. Statistical analysis of the protein environment of N-glycosylation sites: implications for occupancy, structure, and folding.

Authors: Andrei-J Petrescu; Adina-L Milac; Stefana M Petrescu; Raymond A Dwek; Mark R Wormald
Journal: Glycobiology Date: 2003-09-26 Impact factor: 4.313

10. Biases and complex patterns in the residues flanking protein N-glycosylation sites.

Authors: Shifra Ben-Dor; Nir Esterman; Eitan Rubin; Nathan Sharon
Journal: Glycobiology Date: 2003-09-26 Impact factor: 4.313

83 in total

1. Privateer: software for the conformational validation of carbohydrate structures.

Authors: Jon Agirre; Javier Iglesias-Fernández; Carme Rovira; Gideon J Davies; Keith S Wilson; Kevin D Cowtan
Journal: Nat Struct Mol Biol Date: 2015-11 Impact factor: 15.369

Review 2. Glycoscience finally comes of age.

Authors: Anthony H Merry; Catherine L R Merry
Journal: EMBO Rep Date: 2005-10 Impact factor: 8.807

3. Models of protein-ligand crystal structures: trust, but verify.

Authors: Marc C Deller; Bernhard Rupp
Journal: J Comput Aided Mol Des Date: 2015-02-10 Impact factor: 3.686

4. Heavy chain-only IgG2b llama antibody effects near-pan HIV-1 neutralization by recognizing a CD4-induced epitope that includes elements of coreceptor- and CD4-binding sites.

Authors: Priyamvada Acharya; Timothy S Luongo; Ivelin S Georgiev; Julie Matz; Stephen D Schmidt; Mark K Louder; Pascal Kessler; Yongping Yang; Krisha McKee; Sijy O'Dell; Lei Chen; Daniel Baty; Patrick Chames; Loïc Martin; John R Mascola; Peter D Kwong
Journal: J Virol Date: 2013-07-10 Impact factor: 5.103

5. Techniques and tactics used in determining the structure of the trimeric ebolavirus glycoprotein.

Authors: Jeffrey E Lee; Marnie L Fusco; Dafna M Abelson; Ann J Hessell; Dennis R Burton; Erica Ollmann Saphire
Journal: Acta Crystallogr D Biol Crystallogr Date: 2009-10-22

6. An engineered disulfide bond reversibly traps the IgE-Fc3-4 in a closed, nonreceptor binding conformation.

Authors: Beth A Wurzburg; Beomkyu Kim; Svetlana S Tarchevskaya; Alexander Eggel; Monique Vogel; Theodore S Jardetzky
Journal: J Biol Chem Date: 2012-09-04 Impact factor: 5.157

7. Model Building and Refinement of a Natively Glycosylated HIV-1 Env Protein by High-Resolution Cryoelectron Microscopy.

Authors: Jeong Hyun Lee; Natalia de Val; Dmitry Lyumkis; Andrew B Ward
Journal: Structure Date: 2015-09-17 Impact factor: 5.006

8. Interfacial cavity filling to optimize CD4-mimetic miniprotein interactions with HIV-1 surface glycoprotein.

Authors: Laurence Morellato-Castillo; Priyamvada Acharya; Olivier Combes; Johan Michiels; Anne Descours; Oscar H P Ramos; Yongping Yang; Guido Vanham; Kevin K Ariën; Peter D Kwong; Loïc Martin; Pascal Kessler
Journal: J Med Chem Date: 2013-06-11 Impact factor: 7.446

9. Re-refinement from deposited X-ray data can deliver improved models for most PDB entries.

Authors: Robbie P Joosten; Thomas Womack; Gert Vriend; Gérard Bricogne
Journal: Acta Crystallogr D Biol Crystallogr Date: 2009-01-20

10. Analysis and validation of carbohydrate three-dimensional structures.

Authors: Thomas Lütteke
Journal: Acta Crystallogr D Biol Crystallogr Date: 2009-01-20