Literature DB >> 22139937

NRG-CING: integrated validation reports of remediated experimental biomolecular NMR data and coordinates in wwPDB.

Jurgen F Doreleijers¹, Wim F Vranken, Christopher Schulte, John L Markley, Eldon L Ulrich, Gert Vriend, Geerten W Vuister.

Abstract

For many macromolecular NMR ensembles from the Protein Data Bank (PDB) the experiment-based restraint lists are available, while other experimental data, mainly chemical shift values, are often available from the BioMagResBank. The accuracy and precision of the coordinates in these macromolecular NMR ensembles can be improved by recalculation using the available experimental data and present-day software. Such efforts, however, generally fail on half of all NMR ensembles due to the syntactic and semantic heterogeneity of the underlying data and the wide variety of formats used for their deposition. We have combined the remediated restraint information from our NMR Restraints Grid (NRG) database with available chemical shifts from the BioMagResBank and the Common Interface for NMR structure Generation (CING) structure validation reports into the weekly updated NRG-CING database (http://nmr.cmbi.ru.nl/NRG-CING). Eleven programs have been included in the NRG-CING production pipeline to arrive at validation reports that list for each entry the potential inconsistencies between the coordinates and the available experimental NMR data. The longitudinal validation of these data in a publicly available relational database yields a set of indicators that can be used to judge the quality of every macromolecular structure solved with NMR. The remediated NMR experimental data sets and validation reports are freely available online.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2011 PMID： 22139937 PMCID： PMC3245154 DOI： 10.1093/nar/gkr1134

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Experimentally determined biomacromolecular three-dimensional (3D) structures typically are deposited in the Worldwide Protein Data Bank (wwPDB) (1–3) as a requirement by most journals including NAR. As of September 2011, there were over 76 000 entries in the PDB (cf. Table 1) of which ∼9000 entries had been solved by NMR. The BioMagResBank (BMRB) (4) serves as a global repository of experimental NMR data, such as restraints, assigned chemical shifts and dynamic order parameters. Together, these repositories present a valuable resource for numerous research areas in the life sciences.

Table 1.

PDB entries

Set	Entries
PDB	76 003
Solution NMR	9042
NRG-CING	8915
Proteins	7967
Dimers	413
Complexes	1235
Ligands	384
Deposition
Before 1990	9
1990-2000	1920
After 2000	7113

Overview of subsets of PDB entries (23 September 2011).

PDB entries Overview of subsets of PDB entries (23 September 2011). A series of experiments have shown that many NMR structures can be improved if they are recalculated from the original experimental data using present-day software and refinement protocols (5–7) including the STAP database published in this ‘Database’ issue of Nucleic Acids Research. These efforts have revealed that the deposited experimental data were highly heterogeneous in format, completeness and quality. Recently, we performed a large-scale optimization of X-ray derived PDB entries (8), which showed that nearly three quarters of these could be improved in terms of fit with the experimental data and geometric quality (9). The massive scale of this effort also allowed the analysis of even the smallest improvements in a statistically meaningful way (10). Recalculation and proper validation (i.e. validation including the experimental data) both require that the underlying experimental data are syntactically and semantically correct. We have therefore worked for several years on this topic (11,12). In collaboration with the BMRB, we have completed the remediation of the NMR restraint data entries, which resulted in the NMR Restraints Grid (NRG) databases. We recently added the BMRB chemical shift (CS) data and these combined results have been subjected to our integrated NMR structure and experimental data validation analyses, to yield the new database described in this contribution. We have named this database NRG-CING. The database is freely available at http://nmr.cmbi.ru.nl/NRG-CING and it will be updated on a weekly basis. For the NRG-CING pipeline, we have extended the Common Interface for NMR structure Generation (CING; pronounced ‘king’) software package (G. Vuister, et al., CING; an integrated residue-based structure validation program suite, manuscript in preparation). The pipeline first assembles a set of experimental and structural data and then produces a report that includes the results of eleven computer programs that were written by us or by others. The quality of the structure coordinates is currently determined mainly by WHAT_CHECK (12) and PROCHECK-NMR (13). The experimental restraints are tested for consistency and agreement with the structure by CING, Wattos (14), and PROCHECK-NMR/Aqua (13). In addition, the systematic analysis of NMR restraints allowed us to extract new patterns of recurring problems (15). Validation of CS values based on structural and sequence information by CING and the external programs VASCO (16) and SHIFTX (17) and TALOS+ (18) is an integral part of the analyses. The NRG-CING database is a coherent, annotated and verified collection of experimental input data, the resulting structures and the analyses of their quality. NRG-CING will be the basis for recalculation efforts such as the STAP (http://psb.kobic.re.kr/stap/refinement) and LOGRECOORD (7) databases that will lead to better quality NMR structure ensembles that in turn will allow researchers in the life sciences, in drug design and in bioinformatics to better perform their structure-based research.

DATA PREPARATION

Data conversion

The creation of a coherent and validated database of both structures and experimental data requires several steps. For the NRG-CING production pipeline we employed four stages, that we call C, R, S and F denoting coordinate, restraint, chemical shift and filtering, respectively (Figure 1).

Figure 1.

Flow chart. Data flow chart showing the software tools involved in this project: CING, Wattos and FormatConverter (FC). The four stages denoted: C, R, S and F are described in the text. The dashed line indicates an alternative to the default route including all data types. The repositories, programs and data-formats are represented by cylinders, ‘closed rectangles’ and ‘open rectangles’, respectively.

Coordinate stage

The coordinate data flow in from the wwPDB using an mmCIF formatted file that adheres to the PDB eXchange dictionary (pdbx).

Restraints stage

When restraints are present, the coordinates and the restraints are imported directly from the NRG Database Of Converted Restraints [DOCR; (11)] at BMRB as a CCPN XML file.

Shift stage

We developed code in collaboration with BMRB to run through a wide variety of data sources in order to match older entries for which the match relation between BMRB and PDB entries had not yet been archived. The matching algorithms are documented for the NRG part at: http://tinyurl.com/68dd9l9 and the CING part at http://tinyurl.com/67vfuyl. The CS data from BMRB are then merged by the FormatConverter (FC) (19) in a procedure similar to the one used for the restraints (15).

Filter stage

The distance restraints (DR) are stereospecifically checked and in some cases corrected by FC and CING using the same method as currently in use at the BMRB (11). Distance restraints with violations over 2 Å (up to a maximum of three per entry) were omitted from the NRG-CING database and are labelled as outliers. Although such DRs are sometimes correct, the impact of removing correct DRs is deemed to be less detrimental compared to the effects of retaining potentially incorrect ones. In particular, the latter situation could result in unjustified labelling of an entry to be in discord with its experimental data. From anecdotal interactions with depositors we know that these restraints are often errant violations that were not observed at the time of structure calculation, but arose later as a consequence of correcting other problems, for example, typographical errors that led to a restraint being accidentally uncommented or incorrect mapping of one or two atom names. The referencing of the CS is validated during this stage by VASCO, which compares the CS values for the atoms in a protein to their statistical distribution in relation to the coordinate-derived per-atom solvent exposure (16).

Cloud computing

The CING calculations require on average 20 min per entry for a total of 3000 core hours to process the current set of entries. Most of that time is used to run the many external programs and to prepare the large number of plots that report on the data. Because the complete database needs to be reassembled following each major overhaul of the analysis, this project continues to require substantial computing power. As CING has many external program dependencies, it cannot easily be installed on a traditional grid, but we have found it to be very suitable for a cloud computing setup. The eleven programs required for generating a CING report besides CING (G. Vuister et al., manuscript in preparation) are: CCPN (19), DSSP (20), MatPlotLib (http://matplotlib.sourceforge.net), MOLMOL (21), PROCHECK/Aqua (13), Povray (http://www.povray.org) ShiftX (22), TALOS+ (18), VASCO (16), Wattos (14) and WHAT_CHECK (12). We use the cloud facilities at SARA, our industrial partner Bitbrains (Amstelveen, NL, USA) and WeNMR/INFN for each full iteration in the NRG-CING project.

Project management

A large international collaborative project like NRG-CING requires the identification and remediation of issues with software developed and procedures used. From the beginning of this project in 2008, the issues were maintained in a Google Code repository at http://code.google.com/p/cing and linked to the source code in the CING project. Together with the general CING issues, almost all of the 300+ issues currently listed have been addressed. The documentation is described in Wiki pages at the same site. An automatic build and test farm for several Operation Systems is managed by Jenkins Continuous Integration (CI, http://jenkins-ci.org) at http://nmr.cmbi.ru.nl/jenkins/job/CING.

RESULTS

NRG-CING database overall composition

Of the 8915 entries contained in the NRG-CING database (September 2011) 5423 contained experimental data including DRs (Tables 1 and 2). These entries span the full time frame during which NMR structures have been deposited (1988 to present). Analysis of the experimental data variation also showed that the set contains structures determined both from ‘sparse data’, where only a limited amount of structural information was extracted from NMR experiments, and from abundant experimental data.

Table 2.

Statistics of the NRG-CING database

Set	Entries	Per entry count
		Average (SD)	Min.	Max.
Experimental restraints	5519	1392 (1158)	9	11 044
Distances (DRs)	5423	1325 (1107)	11	10 112
AIR DRs only^a	97	27 (14)	11	49
Dihedral angles^b	3401	128 (106)	9	1099
RDCs	426	139 (148)	9	970
Chemical shifts	3626	780 (512)	2	3959
Number of residues	NA	92 (71)	2	1659

aThe number of HADDOCK AIR entries was overestimated by including every NRG-CING entry with <50 DRs. bThe number of entries with dihedral angle restraints is overestimated by including CS derived ones from Talos+.

NA: Not Applicable.

Statistics of the NRG-CING database aThe number of HADDOCK AIR entries was overestimated by including every NRG-CING entry with <50 DRs. bThe number of entries with dihedral angle restraints is overestimated by including CS derived ones from Talos+. NA: Not Applicable.

Examples of longitudinal validation

The CS values of the β and γ carbons of proline have been shown sensitive to the usual trans or the occasionally occurring cis peptide bond configurations. A study based on 33 cis and 1000 trans Pro residues in non-paramagnetic proteins showed a clear clustering for the 13C β/γ CS difference (CSD) values (23). The regions of (0.0, 4.8) and (9.15, 14.4) ppm corresponded with near absolute certainty to the trans and cis conformations, respectively. In NRG-CING we observe 228 cis and 7949 trans Pro in 3435 entries with β/γ carbons CS values obtained from BMRB. We have identified the reversed correspondence for 8 (cis) and over 100 (trans) occurrences. For example, the recent Structural Genomics PDB entry 2k8s (Cort J.R. et al., unpublished results) Pro57 in chain B has a CSD of 11.9 ppm, which indicates a contradiction with the trans state modelled in all conformers of the ensemble. We also observed much more extreme CSD values that are likely caused by human error: e.g. the CSD of Pro71 in PDB entry 2i4k (24) has a very large value (37 ppm) that most likely resulted from uncorrected folding/aliasing of the NMR spectrum. A second example of the combined analysis of chemical shifts in relation to structural quality concerns the sidechain conformation of the leucine delta carbons. Also here, chemical shifts have proven reliable indicators of conformation (25). For the NRG-CING database, 218 (trans) and 115 (gauche+) structured leucine residues in a total of 286 entries showed inconsistencies between observed chemical shifts and χ1/χ2 sidechain conformations, that warrant further investigation (Berntsen, K.R.M. Doreleijers, J.F., Breukels, V., Stens, E., Vriend, G. and Vuister, G.W. manuscript in preparation).

AVAILABILITY

Reports

Currently all wwPDB members (RCSB-PDB, PDBe, PDBj and BMRB) include links to the NRG-CING reports. These pointers drive the vast majority of traffic to the NRG-CING database. The complete NRG-CING database can be accessed by any user. In addition to straightforward selection of specific PDB entries, the front page of the NRG-CING website also allows interactive selection using different criteria, such as protein size, number of distance restraints or chemical shift restraints or ROG score. According to Google Analytics, during the last year NRG-CING was on average visited each day ∼25 times by 9 ‘absolute unique visitors’.

Relational database

In addition to the web-based interactive HTML, CSV dumps from the relational database are available (http://nmr.cmbi.ru.nl/NRG-CING/pgsql). These files can be imported to a slave database using the SQL script at http://tinyurl.com/3rb24eq. The relational database (RDB) contains the validation data at the levels of entry, chain, residue and atom with special tables recently added for DRs and CSs. Many of the validation criteria in CING are also in this relational database, and plots are available at http://nmr.cmbi.ru.nl/NRG-CING/HTML/plot.html, showing the distribution of values such as detailed in Figure 2 for the CING ROG scores. The NRG-CING RDB is setup in conjunction with the PDBj Mine RDB for full cross-correlated access to PDB meta data such as deposition dates (26).

Figure 2.

ROG Results from NRG-CING. The percentage of residues with ROG score red (bad) versus green (good) is plotted with filled circles for 6265 NMR PDB entries from NRG. The red, orange, green (ROG) score is a composite assessment over individual program's validation criteria on the quality of entities such as restraint, coordinate, peak, chemical shift, atom, residue, molecule, etc. The ROG scores are propagated based upon defined relationships between such entities. The entries were selected to have at least: 3.5 kDa molecular mass, 10 models and one protein chain. On the bottom right of the banana-shaped distribution are a minority of entries that have a significant fraction of residues marked red. Note that the percentages green and red taken together with the omitted dimension for orange, add up to 100%.

iCing Server and service

Our multilingual web server (https://nmr.cmbi.ru.nl/icing/) and a web service together are called iCing (see Figure 3). It allows a user to submit NMR-derived coordinates, restraints and CS values in three data formats. We preferentially employ CCPN project files (19), but also accommodate additional data formats, such as the out-dated plain PDB format for structural data only. Although not preferred, this capability does provide the casual user access without sophistication. In collaboration with Dr Torsten Herrmann, we added the capability to upload CYANA formatted data, which will facilitate more standalone programs to integrate with the iCing service.

Figure 3.

iCing Web Server and Service. The screenshot of iCing (Spanish translation selected) shows the customizable definitions for ‘poor’ (orange) and ‘bad’ (red) that CING will use for some WHAT_CHECK parameters. The Google Web Toolkit (GWT) allowed us to easily add German, Spanish, French, Italian, Japanese, Dutch, Portuguese, Russian and Chinese translations to the default English language with help from our colleagues who are native speakers of these languages. The iCing server can be used prior to a submission or, even better, as part of the iterative process of NMR structure determination. Figure 3 shows that the user can customize the validation criteria, which can be useful to specifically focus attention on particular aspects. Generally speaking, however, this is not recommended because the standard criteria are used in deriving the NRG-CING database. Validation of the validation criteria themselves is a topic of ongoing research. The server uses a simple three-tier setup with a Google Web Toolkit 2.0 front end, an Apache/Tomcat secured HTTP servlet, and a backend part including the CING installation. The iCing server has seen 1025 unique views during the first 10 months of 2011, according to Google Analytics. The standalone CCPN Analysis program (19) is using iCing as a service extensively. In total, the iCing service has been used for 1417 data sets in the same period.

FUTURE PERSPECTIVES

Improvements

Although already a valuable resource, as judged from its usage statistics, we continuously seek improvements to the database. We plan to address the following topics: (i) we aim to make the database 100% complete by solving a series of difficult data-related issues (such as Google Code NRG issue 272 and CING issues 266, 310–312) that currently limit us to include only 98.6% of the PDB entries. (ii) We plan on improving the NRG-CING setup with better matches between older BMRB and PDB entries, deposited before the relationship between these was maintained. (iii) Finally, although RDC data are contained within the database, these should be validated as well.

Usage

Finally, NRG-CING only contains the released PDB entries. This journal, Nucleic Acids Research like many journals, encourages authors of new structure papers to provide referees with the output from PDB's validation report from http://deposit.pdb.org/validate. It would be of great value to authors and referees to have these CING reports available in addition to the currently used validation reports on the coordinates alone.

FUNDING

The Netherlands Organization for Scientific Research (NWO) (grant 700.55.443, to G.W.V.), Netherlands Bioinformatics Centre (NBIC); EU FP6 grants STREP Extend-NMR (LSHG-CT-2005-018988) and EMBRACE (LHSG-CT-2004-512092); FP7 WeNMR (grant 261572, to J.F.D., G.V. and G.W.V.); Brussels Institute for Research and Innovation (Innoviris) (grant BB2B 2010-1-12, to W.V.F.); US National Library of Medicine (grant LM05799, to C.S., E.L.U. and J.L.M.). Funding for open access charge: Radboud University Medical Centre Funds to the CMBI. Conflict of interest statement. None declared.

26 in total

1. RefDB: a database of uniformly referenced protein chemical shifts.

Authors: Haiyan Zhang; Stephen Neal; David S Wishart
Journal: J Biomol NMR Date: 2003-03 Impact factor: 2.835

2. Announcing the worldwide Protein Data Bank.

Authors: Helen Berman; Kim Henrick; Haruki Nakamura
Journal: Nat Struct Biol Date: 2003-12

3. Leucine side-chain conformation and dynamics in proteins from 13C NMR chemical shifts.

Authors: Frans A A Mulder
Journal: Chembiochem Date: 2009-06-15 Impact factor: 3.164

4. MOLMOL: a program for display and analysis of macromolecular structures.

Authors: R Koradi; M Billeter; K Wüthrich
Journal: J Mol Graph Date: 1996-02

5. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features.

Authors: W Kabsch; C Sander
Journal: Biopolymers Date: 1983-12 Impact factor: 2.505

6. Validation of archived chemical shifts through atomic coordinates.

Authors: Wolfgang Rieping; Wim F Vranken
Journal: Proteins Date: 2010-08-15

7. PDBj Mine: design and implementation of relational database interface for Protein Data Bank Japan.

Authors: Akira R Kinjo; Reiko Yamashita; Haruki Nakamura
Journal: Database (Oxford) Date: 2010-08-25 Impact factor: 3.451

8. Re-refinement from deposited X-ray data can deliver improved models for most PDB entries.

Authors: Robbie P Joosten; Thomas Womack; Gert Vriend; Gérard Bricogne
Journal: Acta Crystallogr D Biol Crystallogr Date: 2009-01-20

9. The NMR restraints grid at BMRB for 5,266 protein and nucleic acid PDB entries.

Authors: Jurgen F Doreleijers; Wim F Vranken; Christopher Schulte; Jundong Lin; Jonathan R Wedell; Christopher J Penkett; Geerten W Vuister; Gert Vriend; John L Markley; Eldon L Ulrich
Journal: J Biomol NMR Date: 2009-10-07 Impact factor: 2.835

10. Determinants of the endosomal localization of sorting nexin 1.

Authors: Qi Zhong; Martin J Watson; Cheri S Lazar; Andrea M Hounslow; Jonathan P Waltho; Gordon N Gill
Journal: Mol Biol Cell Date: 2005-01-26 Impact factor: 4.138

18 in total

1. Structural insights into and activity analysis of the antimicrobial peptide myxinidin.

Authors: Marco Cantisani; Emiliana Finamore; Eleonora Mignogna; Annarita Falanga; Giovanni Francesco Nicoletti; Carlo Pedone; Giancarlo Morelli; Marilisa Leone; Massimiliano Galdiero; Stefania Galdiero
Journal: Antimicrob Agents Chemother Date: 2014-06-23 Impact factor: 5.191

2. An assignment of intrinsically disordered regions of proteins based on NMR structures.

Authors: Motonori Ota; Ryotaro Koike; Takayuki Amemiya; Takeshi Tenno; Pedro R Romero; Hidekazu Hiroaki; A Keith Dunker; Satoshi Fukuchi
Journal: J Struct Biol Date: 2012-11-07 Impact factor: 2.867

3. The apo-structure of the low molecular weight protein-tyrosine phosphatase A (MptpA) from Mycobacterium tuberculosis allows for better target-specific drug development.

Authors: Tanja Stehle; Sridhar Sreeramulu; Frank Löhr; Christian Richter; Krishna Saxena; Hendrik R A Jonker; Harald Schwalbe
Journal: J Biol Chem Date: 2012-08-10 Impact factor: 5.157

4. Robust and highly accurate automatic NOESY assignment and structure determination with Rosetta.

Authors: Zaiyong Zhang; Justin Porter; Konstantinos Tripsianes; Oliver F Lange
Journal: J Biomol NMR Date: 2014-05-21 Impact factor: 2.835

5. Solution structure of domain 1.1 of the σ^A factor from Bacillus subtilis is preformed for binding to the RNA polymerase core.

Authors: Milan Zachrdla; Petr Padrta; Alžbeta Rabatinová; Hana Šanderová; Ivan Barvík; Libor Krásný; Lukáš Žídek
Journal: J Biol Chem Date: 2017-05-24 Impact factor: 5.157

Review 6. Protein Data Bank (PDB): The Single Global Macromolecular Structure Archive.

Authors: Stephen K Burley; Helen M Berman; Gerard J Kleywegt; John L Markley; Haruki Nakamura; Sameer Velankar
Journal: Methods Mol Biol Date: 2017

7. PDBStat: a universal restraint converter and restraint analysis software package for protein NMR.

Authors: Roberto Tejero; David Snyder; Binchen Mao; James M Aramini; Gaetano T Montelione
Journal: J Biomol NMR Date: 2013-07-30 Impact factor: 2.835

Review 8. Quality assessment of protein NMR structures.

Authors: Antonio Rosato; Roberto Tejero; Gaetano T Montelione
Journal: Curr Opin Struct Biol Date: 2013-09-21 Impact factor: 6.809

9. The structure of myristoylated Mason-Pfizer monkey virus matrix protein and the role of phosphatidylinositol-(4,5)-bisphosphate in its membrane binding.

Authors: Jan Prchal; Pavel Srb; Eric Hunter; Tomáš Ruml; Richard Hrabal
Journal: J Mol Biol Date: 2012-08-02 Impact factor: 5.469

10. NMR and crystal structures of the Pyrococcus horikoshii RadA intein guide a strategy for engineering a highly efficient and promiscuous intein.

Authors: Jesper S Oeemig; Dongwen Zhou; Tommi Kajander; Alexander Wlodawer; Hideo Iwaï
Journal: J Mol Biol Date: 2012-05-02 Impact factor: 5.469