| Literature DB >> 31724231 |
Joanna Lange1,2, Coos Baakman2, Arthur Pistorius2, Elmar Krieger2, Rob Hooft3,4, Robbie P Joosten5, Gert Vriend2,6.
Abstract
We describe a series of databases and tools that directly or indirectly support biomedical research on macromolecules, with focus on their applicability in protein structure bioinformatics research. DSSP, that determines secondary structures of proteins, has been updated to work well with extremely large structures in multiple formats. The PDBREPORT database that lists anomalies in protein structures has been remade to remove many small problems. These reports are now available as PDF-formatted files with a computer-readable summary. The VASE software has been added to analyze and visualize HSSP multiple sequence alignments for protein structures. The Lists collection of databases has been extended with a series of databases, most noticeably with a database that gives each protein structure a grade for usefulness in protein structure bioinformatics projects. The PDB-REDO collection of reanalyzed and re-refined protein structures that were solved by X-ray crystallography has been improved by dealing better with sugar residues and with hydrogen bonds, and adding many missing surface loops. All academic software underlying these protein structure bioinformatics applications and databases are now publicly accessible, either directly from the authors or from the GitHub software repository.Entities:
Keywords: DSSP; PDB; bioinformatics support; protein structure bioinformatics
Mesh:
Substances:
Year: 2019 PMID: 31724231 PMCID: PMC6933850 DOI: 10.1002/pro.3788
Source DB: PubMed Journal: Protein Sci ISSN: 0961-8368 Impact factor: 6.725
Figure 1Examples that explain why not all PDB entries are equally useful for PSB studies. (a) The long lines indicate very long bonds in 1I4C. These seem to be caused by incorrect formatting of the PDB entry. Coordinates are written with only two characters before the decimal point, which makes them lose the minus sign when a coordinate is below −9.999; for example, an X‐coordinate like −11.236 becomes 11.236. (b) The N‐terminus of the azurin structure, 1AG0, once consisted of two half alanines with a copper ion in the middle. This problem has recently been solved, but the PDB provides no easy mechanism for correction tracking. (c) This small polyglycine helix (1CEK) is actually the result of a complex biophysical experiment. There is nothing wrong with this structure, but we believe that it should not be used in PDB‐wide computational studies. (d) Something went very wrong when solving this protein structure (2PDE, subunit‐binding domain of dihydrolipoamide acetyltransferase), the very short helix that is indicated by a short blue cylinder works as a chaotic attractor through which the chain passes seven times
Summary of the facilities mentioned in this article
| Facility | Short description |
|---|---|
| wwPDB | Worldwide PDB. Macromolecular data collection and distribution. |
| UniProt | Worldwide collection of protein sequences |
| Swiss‐Prot | Manually annotated and reviewed section of the UniProt |
| PDB‐REDO | Reanalyzed, consistently treated structure models |
| BDB | PDB entries with standardized, isotropic B factors |
| DSSP | Secondary structure assignment for protein structure models |
| DSSP_REDO (novel) | Like DSSP, but for PDB‐REDO entries |
| WHAT IF | Protein structure calculations. Outdated; some useful aspects as servers. |
| YASARA_View | Free (feature‐rich) molecular viewer that additionally contains all of WHAT IF |
| HSSP | Multiple sequence alignments against UniProt for proteins in PDB. |
| VASE | Visualization of entropy/variability values in HSSP entries |
| PDB‐Vis | Visualization of crystal packing contacts, and a few more things |
| LigPlot | 2D representation of ligand–macromolecule interactions |
|
| Sets of precalculated data for all PDB entries (see Table |
| PDB_REPORT | Reports (in a PDF format) about anomalies and errors in PDB entries |
| WHY_NOT | System that explains why certain data files are not present |
| pdbad | Anecdotal list of problems in PDB entries |
| PDBFINDER | Easy to parse PDB metadata |
| PDBsum | Summary of PDB metadata useful for PSB |
| LigPlot | Visualizes interactions between macromolecules and ligands |
| PDB_SELECT | Subsets of PDB entries (sequence unique at 30% cutoff) |
| PISCES | More extensive system for making sequence unique subsets of PDB entries |
| CATH | Three facilities that shed light on the 3D relations between PDB entries |
| Swiss‐Model | Builds homology models |
| Sternberg | Examples of long‐time stable group pages with many useful facilities |
Notes: The facilities in blue are produced by the authors of this article; the facilities in black are produced and maintained by others and we believe them to be useful for users of our facilities. Most of our facilities can be downloaded from ftp://ftp.cmbi.umcn.nl/pub/molbio/data/ and the software from the cmbi section in GitHub. YASARA_View can be downloaded freely from http://www.yasara.org/. http://swift.cmbi.umcn.nl/gv/facilities/ provides extensive documentation for the databases and instructions for obtaining an in‐house copy via rsync. Facilities without reference have not been published explicitly yet. References to some facilities are best extracted from those facilities' web pages. The facilities that are published here for the first time are labeled with the word “novel”. Facilities that underwent major updates since they were last published are labeled with the word “improved.”
http://www.uniprot.org/.
https://web.expasy.org/docs/swiss-prot_guideline.html.
http://www.yasara.org/viewdl.htm.
http://www.cmbi.umcn.nl/vase/.
http://www.cmbi.umcn.nl/pdb-vis/.
http://www.ebi.ac.uk/thornton-srv/software/LigPlus/.
http://www.cmbi.umcn.nl/why_not2/.
http://swift.cmbi.umcn.nl/teach/pdbad/.
http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/pdbsum/GetPage.pl?pdbcode=index.html.
https://www.ebi.ac.uk/thornton-srv/software/LIGPLOT/.
http://dunbrack.fccc.edu/PISCES.php.
https://www.cathdb.info/.
http://scop.mrc-lmb.cam.ac.uk/.
http://ekhidna2.biocenter.helsinki.fi/dali/.
https://swissmodel.expasy.org/.
http://www.sbg.bio.ic.ac.uk/.
http://www.cbs.dtu.dk/services/.
Figure 2Protein structure quality improvements obtained by PDB‐REDO. (a) Percentage of files for which the PDB‐REDO entry showed “better” characteristics in green, PDB better in red, and equally good in yellow. (b) This shows how often the PDB‐REDO entry was better/worse than the PDB entry for 4 to 7 of the categories used in (a). These analyses were run on all entries available in August 2019. We manually checked many of the 1,195 cases for which all seven categories showed worse statistics for the PDB‐REDO entry and found many examples of structures solved at extremely low resolution (like 6.0 A or lower). We also observed that the distribution of cases in which all model quality indicators deteriorated was skewed toward being more recently made PDB‐REDO entries. This was (at least partially) traced back to bugs in third‐party programs, which PDB‐REDO did not intercept correctly. These bugs are being fixed, and the affected PDB‐REDO entries will be replaced
Figure 3Examples of secondary structure assignments in PDB entries. (a) Ribbon drawing of http://firstglance.jmol.org/fg.htm?mol=1M2I (Cytochrome b5) colored by the PDB‐assigned secondary structure. Cyan: not assigned; blue: helix; red: sheet. The ball model is a heme group that is held by two histidine side chains (shown in cyan and blue). The agreement in secondary structure assignment between YASARA and DSSP is higher than that between PDB and DSSP. Additionally, one 310‐helix is called helix in the PDB entry, while another 310‐helix is not assigned. WHAT IF normally reduces the eight‐character DSSP alphabet to a four‐character one in which 3/10‐helix and π‐helix are combined with α‐helix into H, and nonhydrogen‐bonded strands are called loop. (b) Example of a nearly unsolvable problem. The glycine (Gly‐20) in http://firstglance.jmol.org/fg.htm?mol=1KC7 (pyruvate phosphate dikinase) that is located in the corner “between” the longer horizontal helix and the single‐turn helix in the upper right actually has a H‐bond in each of these two helices. This makes it look like 1 long helix in DSSP and in Lists entries. The only solution we see for this rare problem (glycine 107 in http://firstglance.jmol.org/fg.htm?mol=1L51, lysozyme is another example) is manual annotation in PDB entries. Most programs, including DSSP and YASARA, have their own way of dealing with secondary structure assignments often based on hydrogen bond assignments with cutoffs that are a bit arbitrary. It seems wise to always use the same arbitrariness. That is why we decided to make DSSP open source with a very permissive license so that DSSP can be incorporated in other programs
Secondary structure assignments of PDB versus PDB‐REDO
|
| REDO longer | REDO shorter |
|---|---|---|
| 1 | 1,500 | 1,298 |
| 2 | 660 | 268 |
| 3 | 303 | 310 |
| >3 | 230 | 87 |
|
|
|
|
| 1 | 9,761 | 6,368 |
| 2 | 2,256 | 1,106 |
| 3 | 637 | 459 |
| >3 | 531 | 220 |
Notes: The DSSP assignments were compared of 135K helices that were at least six residues long in the PDB‐REDO entry. We then asked how often the helix in PDB‐REDO was shorter or longer than in the corresponding PDB entry. In 1500 cases, the helix in the PDB‐REDO entry was one residue longer than in the corresponding PDB entry. Re‐refinement by PDB‐REDO tends to make helices more often longer than shorter. 4% of the residues assigned differently near helix ends are assigned strand in one of the two files. 108K helices were equally long in the two corresponding entries.
Figure 4Homology modeling threshold curve. This plot describes which minimum percentage sequence identity in an alignment of a given length is an indication that the aligned proteins have similar structures. This is frequently used in the context of homology modeling: alignments above the curve indicate that it is possible to make a fairly reliable homology model from the aligned template; alignments below the curve mean that a homology model should be handled with care. We also use this plot for the inclusion of sequences in HSSP alignments
Figure 5Some variants of the word cacodylate found in PDB entries. The question marks indicate that those words were also found a few dozen times in the literature, at locations where cacodylate could be expected
Figure 6PDBFINDER entry for the PDB entry 1CRN (crambin). Most key value combinations are self‐explanatory. Indentation of a key indicates that it is a child of the unindented parent above it. The six bottom lines are extracted from the corresponding DSSP and HSSP entries. For the bottom four lines, the HSSP‐derived values are scaled to 0.0–9.0 and represented by the nearest integer to that scaled number. PDBFINDER2 entries additionally hold many lines of calculated per‐residue information, including average B factors, packing normality, geometric anomalies, side chain flips, and so on
Available Lists databases
|
| Derived data type |
|---|---|
| chi | Torsion angles (φ, ψ, Ω, χ1‐5) |
| tau | Backbone angle τ |
| acc | Accessible molecular surface area |
| asa | Accessible surface area |
| dsp | Secondary structure overview |
| cc1 | Cα–Cα distance <12.5 Å |
| cc2 | Residue spheres closer to 0.25 Å |
| cc3 | Residue spheres closer to 2.5 Å |
| cc4 | Residues with an atom pair closer to 0.25 Å |
| cc5 | Residues with an atom pair closer to 2.5 Å |
| cc6 | Residues with a side chain atom pair closer to 0.25 Å |
| cc7 | Residues with a side chain atom pair closer to 2.5 Å |
| cc8 | Residues in different chains with a side chain atom pair closer to 2.5 Å |
| cc9 | Cβ–Cβ distance <12.5 Å |
| cli | Residues with an atomic contact to ligand <1.0 Å |
| cnu | Residue‐nucleic acid spheres closer to 1.0 Å |
| iod | Residue distance to the nearest (positive) ion |
| ion | Short residue–ion distances, grouped per ion |
| cys | Cysteine bridges |
| sbr | Salt bridges |
| sbh | Salt bridges assuming histidine is positive |
| qua | Coarse‐packing quality |
| nqa | Fine‐packing quality |
| flp | Backbone peptide‐plane flips |
| rot | Residue rotamer scores |
| sou | WHAT IF's interpretation of the PDB entry |
| sco | Gives each PDB entry a bioinformatics–usability score from 0.0 to 10.0 |
Notes: Each Lists database is extensively described at the Lists website. Angles, for example, in the “chi” and “tau” databases are in degrees between −180.0 and 180.0; accessibility values in “acc” and “asa” are in square Ångströms, missing values often are set at −999.9, and so on. These Lists databases are grouped. Databases that provide elementary geometric parameters are listed in red. Amino acid contact databases intended for use in protein structure prediction PSB projects are in blue. Databases with other contacts are in green. Databases related to protein structure quality and normality are in yellow. The “sou” and “sco” databases, in purple, are special and are explained in the text.
Figure 7First few lines of an example chi Lists database. Entry: From left to right, the columns are the sequential residue number; the residue type; the PDB residue number; the protein chain identifier; the secondary structure according to DSSP (in a reduced alphabet); φ, ψ, Ω, and χ1‐5. No calculations are done for residues that are not completely intact. The text “Residue is not intact” is used for this purpose throughout all Lists databases
The “acc” Lists database has no data entry (and thus a .whynot entry) for more than 30K PDB entries
| 11,027 | COMMENT: MODEL records found |
|---|---|
| 3,064 | COMMENT: Not an X‐ray structure |
| 102 | COMMENT: Not enough intact residues |
| 1950 | COMMENT: Not enough residues |
| 438 | COMMENT: Percentage bad residues too high |
| 15,729 | COMMENT: Too many bad residues |
| 440 | COMMENT: Too many C‐alpha‐only residues |
| 825 | COMMENT: Too many residues |
Notes: 11K of those are either structures solved by NMR or multimodel X‐ray files; more than 15K entries are missing because they hold too many amino acids with missing atoms (mainly side chain atoms of Glu, Arg, Lys, and Gln); 3K are missing because they are solved by another technique compared with X‐ray technique (these are mainly EM structures, and structures solved by NMR for which only one structure is given rather than a multimodel ensemble). Other Lists databases know other entry types. For example, the Lists database for cysteine bridges has about 24K entries “COMMENT: Contains no cysteine bridges” and about 7K entries “COMMENT: Contains no cysteines”. These criteria are applied first for the “cys” database, and therefore the entry “COMMENT: Too many bad residues” occurs only a few thousand times in the “cys” database rather than almost 16K times for the “acc” database.
Figure 8Examples of WHAT IF calculations visualized in YASARA_View. (a) Crystal structure of dihydrofolate reductase with inhibitor methotrexate (4DFR). Protein colored by HSSP conservation weights found in the PDBFINDER2 database from blue (not conserved) via red to yellow (totally conserved). Contacts between protein and ligand are shown: hydrogen bonds (yellow), hydrophobic contacts (green), and π‐interactions (red). (b) Crystal structure of the protein TolA domain III (3QDR). Strong clashes are reported by the PDBREPORT and PDBFINDER2 databases for residues Ile 378, Ile 367, and Phe 412 (indicated with gray arrows). Backbone‐dependent Ile rotamers from the WHAT IF DGROTA option are shown as thin sticks and colored from blue to yellow. Clashes can be resolved in this example by choosing more populated rotamers