For many macromolecular NMR ensembles from the Protein Data Bank (PDB) the experiment-based restraint lists are available, while other experimental data, mainly chemical shift values, are often available from the BioMagResBank. The accuracy and precision of the coordinates in these macromolecular NMR ensembles can be improved by recalculation using the available experimental data and present-day software. Such efforts, however, generally fail on half of all NMR ensembles due to the syntactic and semantic heterogeneity of the underlying data and the wide variety of formats used for their deposition. We have combined the remediated restraint information from our NMR Restraints Grid (NRG) database with available chemical shifts from the BioMagResBank and the Common Interface for NMR structure Generation (CING) structure validation reports into the weekly updated NRG-CING database (http://nmr.cmbi.ru.nl/NRG-CING). Eleven programs have been included in the NRG-CING production pipeline to arrive at validation reports that list for each entry the potential inconsistencies between the coordinates and the available experimental NMR data. The longitudinal validation of these data in a publicly available relational database yields a set of indicators that can be used to judge the quality of every macromolecular structure solved with NMR. The remediated NMR experimental data sets and validation reports are freely available online.
For many macromolecular NMR ensembles from the Protein Data Bank (PDB) the experiment-based restraint lists are available, while other experimental data, mainly chemical shift values, are often available from the BioMagResBank. The accuracy and precision of the coordinates in these macromolecular NMR ensembles can be improved by recalculation using the available experimental data and present-day software. Such efforts, however, generally fail on half of all NMR ensembles due to the syntactic and semantic heterogeneity of the underlying data and the wide variety of formats used for their deposition. We have combined the remediated restraint information from our NMR Restraints Grid (NRG) database with available chemical shifts from the BioMagResBank and the Common Interface for NMR structure Generation (CING) structure validation reports into the weekly updated NRG-CING database (http://nmr.cmbi.ru.nl/NRG-CING). Eleven programs have been included in the NRG-CING production pipeline to arrive at validation reports that list for each entry the potential inconsistencies between the coordinates and the available experimental NMR data. The longitudinal validation of these data in a publicly available relational database yields a set of indicators that can be used to judge the quality of every macromolecular structure solved with NMR. The remediated NMR experimental data sets and validation reports are freely available online.
Experimentally determined biomacromolecular three-dimensional (3D) structures typically are deposited in the Worldwide Protein Data Bank (wwPDB) (1–3) as a requirement by most journals including NAR. As of September 2011, there were over 76 000 entries in the PDB (cf.
Table 1) of which ∼9000 entries had been solved by NMR. The BioMagResBank (BMRB) (4) serves as a global repository of experimental NMR data, such as restraints, assigned chemical shifts and dynamic order parameters. Together, these repositories present a valuable resource for numerous research areas in the life sciences.
Table 1.
PDB entries
Set
Entries
PDB
76 003
Solution NMR
9042
NRG-CING
8915
Proteins
7967
Dimers
413
Complexes
1235
Ligands
384
Deposition
Before 1990
9
1990-2000
1920
After 2000
7113
Overview of subsets of PDB entries (23 September 2011).
PDB entriesOverview of subsets of PDB entries (23 September 2011).A series of experiments have shown that many NMR structures can be improved if they are recalculated from the original experimental data using present-day software and refinement protocols (5–7) including the STAP database published in this ‘Database’ issue of Nucleic Acids Research. These efforts have revealed that the deposited experimental data were highly heterogeneous in format, completeness and quality. Recently, we performed a large-scale optimization of X-ray derived PDB entries (8), which showed that nearly three quarters of these could be improved in terms of fit with the experimental data and geometric quality (9). The massive scale of this effort also allowed the analysis of even the smallest improvements in a statistically meaningful way (10).Recalculation and proper validation (i.e. validation including the experimental data) both require that the underlying experimental data are syntactically and semantically correct. We have therefore worked for several years on this topic (11,12). In collaboration with the BMRB, we have completed the remediation of the NMR restraint data entries, which resulted in the NMR Restraints Grid (NRG) databases. We recently added the BMRB chemical shift (CS) data and these combined results have been subjected to our integrated NMR structure and experimental data validation analyses, to yield the new database described in this contribution. We have named this database NRG-CING. The database is freely available at http://nmr.cmbi.ru.nl/NRG-CING and it will be updated on a weekly basis. For the NRG-CING pipeline, we have extended the Common Interface for NMR structure Generation (CING; pronounced ‘king’) software package (G. Vuister, et al., CING; an integrated residue-based structure validation program suite, manuscript in preparation). The pipeline first assembles a set of experimental and structural data and then produces a report that includes the results of eleven computer programs that were written by us or by others. The quality of the structure coordinates is currently determined mainly by WHAT_CHECK (12) and PROCHECK-NMR (13). The experimental restraints are tested for consistency and agreement with the structure by CING, Wattos (14), and PROCHECK-NMR/Aqua (13). In addition, the systematic analysis of NMR restraints allowed us to extract new patterns of recurring problems (15). Validation of CS values based on structural and sequence information by CING and the external programs VASCO (16) and SHIFTX (17) and TALOS+ (18) is an integral part of the analyses.The NRG-CING database is a coherent, annotated and verified collection of experimental input data, the resulting structures and the analyses of their quality. NRG-CING will be the basis for recalculation efforts such as the STAP (http://psb.kobic.re.kr/stap/refinement) and LOGRECOORD (7) databases that will lead to better quality NMR structure ensembles that in turn will allow researchers in the life sciences, in drug design and in bioinformatics to better perform their structure-based research.
DATA PREPARATION
Data conversion
The creation of a coherent and validated database of both structures and experimental data requires several steps. For the NRG-CING production pipeline we employed four stages, that we call C, R, S and F denoting coordinate, restraint, chemical shift and filtering, respectively (Figure 1).
Figure 1.
Flow chart. Data flow chart showing the software tools involved in this project: CING, Wattos and FormatConverter (FC). The four stages denoted: C, R, S and F are described in the text. The dashed line indicates an alternative to the default route including all data types. The repositories, programs and data-formats are represented by cylinders, ‘closed rectangles’ and ‘open rectangles’, respectively.
Flow chart. Data flow chart showing the software tools involved in this project: CING, Wattos and FormatConverter (FC). The four stages denoted: C, R, S and F are described in the text. The dashed line indicates an alternative to the default route including all data types. The repositories, programs and data-formats are represented by cylinders, ‘closed rectangles’ and ‘open rectangles’, respectively.
Coordinate stage
The coordinate data flow in from the wwPDB using an mmCIF formatted file that adheres to the PDB eXchange dictionary (pdbx).
Restraints stage
When restraints are present, the coordinates and the restraints are imported directly from the NRG Database Of Converted Restraints [DOCR; (11)] at BMRB as a CCPN XML file.
Shift stage
We developed code in collaboration with BMRB to run through a wide variety of data sources in order to match older entries for which the match relation between BMRB and PDB entries had not yet been archived. The matching algorithms are documented for the NRG part at: http://tinyurl.com/68dd9l9 and the CING part at http://tinyurl.com/67vfuyl. The CS data from BMRB are then merged by the FormatConverter (FC) (19) in a procedure similar to the one used for the restraints (15).
Filter stage
The distance restraints (DR) are stereospecifically checked and in some cases corrected by FC and CING using the same method as currently in use at the BMRB (11). Distance restraints with violations over 2 Å (up to a maximum of three per entry) were omitted from the NRG-CING database and are labelled as outliers. Although such DRs are sometimes correct, the impact of removing correct DRs is deemed to be less detrimental compared to the effects of retaining potentially incorrect ones. In particular, the latter situation could result in unjustified labelling of an entry to be in discord with its experimental data. From anecdotal interactions with depositors we know that these restraints are often errant violations that were not observed at the time of structure calculation, but arose later as a consequence of correcting other problems, for example, typographical errors that led to a restraint being accidentally uncommented or incorrect mapping of one or two atom names. The referencing of the CS is validated during this stage by VASCO, which compares the CS values for the atoms in a protein to their statistical distribution in relation to the coordinate-derived per-atom solvent exposure (16).
Cloud computing
The CING calculations require on average 20 min per entry for a total of 3000 core hours to process the current set of entries. Most of that time is used to run the many external programs and to prepare the large number of plots that report on the data. Because the complete database needs to be reassembled following each major overhaul of the analysis, this project continues to require substantial computing power. As CING has many external program dependencies, it cannot easily be installed on a traditional grid, but we have found it to be very suitable for a cloud computing setup. The eleven programs required for generating a CING report besides CING (G. Vuister et al., manuscript in preparation) are: CCPN (19), DSSP (20), MatPlotLib (http://matplotlib.sourceforge.net), MOLMOL (21), PROCHECK/Aqua (13), Povray (http://www.povray.org) ShiftX (22), TALOS+ (18), VASCO (16), Wattos (14) and WHAT_CHECK (12). We use the cloud facilities at SARA, our industrial partner Bitbrains (Amstelveen, NL, USA) and WeNMR/INFN for each full iteration in the NRG-CING project.
Project management
A large international collaborative project like NRG-CING requires the identification and remediation of issues with software developed and procedures used. From the beginning of this project in 2008, the issues were maintained in a Google Code repository at http://code.google.com/p/cing and linked to the source code in the CING project. Together with the general CING issues, almost all of the 300+ issues currently listed have been addressed. The documentation is described in Wiki pages at the same site. An automatic build and test farm for several Operation Systems is managed by Jenkins Continuous Integration (CI, http://jenkins-ci.org) at http://nmr.cmbi.ru.nl/jenkins/job/CING.
RESULTS
NRG-CING database overall composition
Of the 8915 entries contained in the NRG-CING database (September 2011) 5423 contained experimental data including DRs (Tables 1 and 2). These entries span the full time frame during which NMR structures have been deposited (1988 to present). Analysis of the experimental data variation also showed that the set contains structures determined both from ‘sparse data’, where only a limited amount of structural information was extracted from NMR experiments, and from abundant experimental data.
Table 2.
Statistics of the NRG-CING database
Set
Entries
Per entry count
Average (SD)
Min.
Max.
Experimental restraints
5519
1392 (1158)
9
11 044
Distances (DRs)
5423
1325 (1107)
11
10 112
AIR DRs onlya
97
27 (14)
11
49
Dihedral anglesb
3401
128 (106)
9
1099
RDCs
426
139 (148)
9
970
Chemical shifts
3626
780 (512)
2
3959
Number of residues
NA
92 (71)
2
1659
aThe number of HADDOCK AIR entries was overestimated by including every NRG-CING entry with <50 DRs. bThe number of entries with dihedral angle restraints is overestimated by including CS derived ones from Talos+.
NA: Not Applicable.
Statistics of the NRG-CING databaseaThe number of HADDOCK AIR entries was overestimated by including every NRG-CING entry with <50 DRs. bThe number of entries with dihedral angle restraints is overestimated by including CS derived ones from Talos+.NA: Not Applicable.
Examples of longitudinal validation
The CS values of the β and γ carbons of proline have been shown sensitive to the usual trans or the occasionally occurring cis peptide bond configurations. A study based on 33 cis and 1000 trans Pro residues in non-paramagnetic proteins showed a clear clustering for the 13C β/γ CS difference (CSD) values (23). The regions of (0.0, 4.8) and (9.15, 14.4) ppm corresponded with near absolute certainty to the trans and cis conformations, respectively. In NRG-CING we observe 228 cis and 7949 trans Pro in 3435 entries with β/γ carbonsCS values obtained from BMRB. We have identified the reversed correspondence for 8 (cis) and over 100 (trans) occurrences. For example, the recent Structural Genomics PDB entry 2k8s (Cort J.R. et al., unpublished results) Pro57 in chain B has a CSD of 11.9 ppm, which indicates a contradiction with the trans state modelled in all conformers of the ensemble. We also observed much more extreme CSD values that are likely caused by human error: e.g. the CSD of Pro71 in PDB entry 2i4k (24) has a very large value (37 ppm) that most likely resulted from uncorrected folding/aliasing of the NMR spectrum.A second example of the combined analysis of chemical shifts in relation to structural quality concerns the sidechain conformation of the leucine delta carbons. Also here, chemical shifts have proven reliable indicators of conformation (25). For the NRG-CING database, 218 (trans) and 115 (gauche+) structured leucine residues in a total of 286 entries showed inconsistencies between observed chemical shifts and χ1/χ2 sidechain conformations, that warrant further investigation (Berntsen, K.R.M. Doreleijers, J.F., Breukels, V., Stens, E., Vriend, G. and Vuister, G.W. manuscript in preparation).
AVAILABILITY
Reports
Currently all wwPDB members (RCSB-PDB, PDBe, PDBj and BMRB) include links to the NRG-CING reports. These pointers drive the vast majority of traffic to the NRG-CING database. The complete NRG-CING database can be accessed by any user. In addition to straightforward selection of specific PDB entries, the front page of the NRG-CING website also allows interactive selection using different criteria, such as protein size, number of distance restraints or chemical shift restraints or ROG score. According to Google Analytics, during the last year NRG-CING was on average visited each day ∼25 times by 9 ‘absolute unique visitors’.
Relational database
In addition to the web-based interactive HTML, CSV dumps from the relational database are available (http://nmr.cmbi.ru.nl/NRG-CING/pgsql). These files can be imported to a slave database using the SQL script at http://tinyurl.com/3rb24eq. The relational database (RDB) contains the validation data at the levels of entry, chain, residue and atom with special tables recently added for DRs and CSs. Many of the validation criteria in CING are also in this relational database, and plots are available at http://nmr.cmbi.ru.nl/NRG-CING/HTML/plot.html, showing the distribution of values such as detailed in Figure 2 for the CING ROG scores. The NRG-CING RDB is setup in conjunction with the PDBj Mine RDB for full cross-correlated access to PDB meta data such as deposition dates (26).
Figure 2.
ROG Results from NRG-CING. The percentage of residues with ROG score red (bad) versus green (good) is plotted with filled circles for 6265 NMR PDB entries from NRG. The red, orange, green (ROG) score is a composite assessment over individual program's validation criteria on the quality of entities such as restraint, coordinate, peak, chemical shift, atom, residue, molecule, etc. The ROG scores are propagated based upon defined relationships between such entities. The entries were selected to have at least: 3.5 kDa molecular mass, 10 models and one protein chain. On the bottom right of the banana-shaped distribution are a minority of entries that have a significant fraction of residues marked red. Note that the percentages green and red taken together with the omitted dimension for orange, add up to 100%.
ROG Results from NRG-CING. The percentage of residues with ROG score red (bad) versus green (good) is plotted with filled circles for 6265 NMR PDB entries from NRG. The red, orange, green (ROG) score is a composite assessment over individual program's validation criteria on the quality of entities such as restraint, coordinate, peak, chemical shift, atom, residue, molecule, etc. The ROG scores are propagated based upon defined relationships between such entities. The entries were selected to have at least: 3.5 kDa molecular mass, 10 models and one protein chain. On the bottom right of the banana-shaped distribution are a minority of entries that have a significant fraction of residues marked red. Note that the percentages green and red taken together with the omitted dimension for orange, add up to 100%.
iCing Server and service
Our multilingual web server (https://nmr.cmbi.ru.nl/icing/) and a web service together are called iCing (see Figure 3). It allows a user to submit NMR-derived coordinates, restraints and CS values in three data formats. We preferentially employ CCPN project files (19), but also accommodate additional data formats, such as the out-dated plain PDB format for structural data only. Although not preferred, this capability does provide the casual user access without sophistication. In collaboration with Dr Torsten Herrmann, we added the capability to upload CYANA formatted data, which will facilitate more standalone programs to integrate with the iCing service.
Figure 3.
iCing Web Server and Service. The screenshot of iCing (Spanish translation selected) shows the customizable definitions for ‘poor’ (orange) and ‘bad’ (red) that CING will use for some WHAT_CHECK parameters. The Google Web Toolkit (GWT) allowed us to easily add German, Spanish, French, Italian, Japanese, Dutch, Portuguese, Russian and Chinese translations to the default English language with help from our colleagues who are native speakers of these languages.
iCing Web Server and Service. The screenshot of iCing (Spanish translation selected) shows the customizable definitions for ‘poor’ (orange) and ‘bad’ (red) that CING will use for some WHAT_CHECK parameters. The Google Web Toolkit (GWT) allowed us to easily add German, Spanish, French, Italian, Japanese, Dutch, Portuguese, Russian and Chinese translations to the default English language with help from our colleagues who are native speakers of these languages.The iCing server can be used prior to a submission or, even better, as part of the iterative process of NMR structure determination. Figure 3 shows that the user can customize the validation criteria, which can be useful to specifically focus attention on particular aspects. Generally speaking, however, this is not recommended because the standard criteria are used in deriving the NRG-CING database. Validation of the validation criteria themselves is a topic of ongoing research.The server uses a simple three-tier setup with a Google Web Toolkit 2.0 front end, an Apache/Tomcat secured HTTP servlet, and a backend part including the CING installation. The iCing server has seen 1025 unique views during the first 10 months of 2011, according to Google Analytics. The standalone CCPN Analysis program (19) is using iCing as a service extensively. In total, the iCing service has been used for 1417 data sets in the same period.
FUTURE PERSPECTIVES
Improvements
Although already a valuable resource, as judged from its usage statistics, we continuously seek improvements to the database. We plan to address the following topics: (i) we aim to make the database 100% complete by solving a series of difficult data-related issues (such as Google Code NRG issue 272 and CING issues 266, 310–312) that currently limit us to include only 98.6% of the PDB entries. (ii) We plan on improving the NRG-CING setup with better matches between older BMRB and PDB entries, deposited before the relationship between these was maintained. (iii) Finally, although RDC data are contained within the database, these should be validated as well.
Usage
Finally, NRG-CING only contains the released PDB entries. This journal, Nucleic Acids Research like many journals, encourages authors of new structure papers to provide referees with the output from PDB's validation report from http://deposit.pdb.org/validate. It would be of great value to authors and referees to have these CING reports available in addition to the currently used validation reports on the coordinates alone.
FUNDING
The Netherlands Organization for Scientific Research (NWO) (grant 700.55.443, to G.W.V.), Netherlands Bioinformatics Centre (NBIC); EU FP6 grants STREP Extend-NMR (LSHG-CT-2005-018988) and EMBRACE (LHSG-CT-2004-512092); FP7 WeNMR (grant 261572, to J.F.D., G.V. and G.W.V.); Brussels Institute for Research and Innovation (Innoviris) (grant BB2B 2010-1-12, to W.V.F.); US National Library of Medicine (grant LM05799, to C.S., E.L.U. and J.L.M.). Funding for open access charge: Radboud University Medical Centre Funds to the CMBI.Conflict of interest statement. None declared.
Authors: Jurgen F Doreleijers; Wim F Vranken; Christopher Schulte; Jundong Lin; Jonathan R Wedell; Christopher J Penkett; Geerten W Vuister; Gert Vriend; John L Markley; Eldon L Ulrich Journal: J Biomol NMR Date: 2009-10-07 Impact factor: 2.835
Authors: Qi Zhong; Martin J Watson; Cheri S Lazar; Andrea M Hounslow; Jonathan P Waltho; Gordon N Gill Journal: Mol Biol Cell Date: 2005-01-26 Impact factor: 4.138
Authors: Tanja Stehle; Sridhar Sreeramulu; Frank Löhr; Christian Richter; Krishna Saxena; Hendrik R A Jonker; Harald Schwalbe Journal: J Biol Chem Date: 2012-08-10 Impact factor: 5.157