| Literature DB >> 26947396 |
Peter A Meyer1, Stephanie Socias1, Jason Key1, Elizabeth Ransey1, Emily C Tjon1, Alejandro Buschiazzo2,3, Ming Lei4, Chris Botka5, James Withrow6, David Neau6, Kanagalaghatta Rajashankar6, Karen S Anderson7, Richard H Baxter8, Stephen C Blacklow1, Titus J Boggon7, Alexandre M J J Bonvin9, Dominika Borek10, Tom J Brett11, Amedeo Caflisch12, Chung-I Chang13, Walter J Chazin14, Kevin D Corbett15,16, Michael S Cosgrove17, Sean Crosson18, Sirano Dhe-Paganon19, Enrico Di Cera20, Catherine L Drennan21, Michael J Eck1,19, Brandt F Eichman22, Qing R Fan23, Adrian R Ferré-D'Amaré24, J Christopher Fromme25, K Christopher Garcia26,27,28, Rachelle Gaudet29, Peng Gong30, Stephen C Harrison1,31,32, Ekaterina E Heldwein33, Zongchao Jia34, Robert J Keenan18, Andrew C Kruse1, Marc Kvansakul35, Jason S McLellan36, Yorgo Modis37, Yunsun Nam38, Zbyszek Otwinowski10, Emil F Pai39,40, Pedro José Barbosa Pereira41, Carlo Petosa42, C S Raman43, Tom A Rapoport44, Antonina Roll-Mecak45,46, Michael K Rosen47, Gabby Rudenko48, Joseph Schlessinger49, Thomas U Schwartz50, Yousif Shamoo51, Holger Sondermann52, Yizhi J Tao51, Niraj H Tolia53, Oleg V Tsodikov54, Kenneth D Westover55, Hao Wu1,56, Ian Foster57, James S Fraser58, Filipe R N C Maia59,60, Tamir Gonen61, Tom Kirchhausen62,63, Kay Diederichs64, Mercè Crosas65, Piotr Sliz1.
Abstract
Access to experimental X-ray diffraction image data is fundamental for validation and reproduction of macromolecular models and indispensable for development of structural biology processing methods. Here, we established a diffraction data publication and dissemination system, Structural Biology Data Grid (SBDG; data.sbgrid.org), to preserve primary experimental data sets that support scientific publications. Data sets are accessible to researchers through a community driven data grid, which facilitates global data access. Our analysis of a pilot collection of crystallographic data sets demonstrates that the information archived by SBDG is sufficient to reprocess data to statistics that meet or exceed the quality of the original published structures. SBDG has extended its services to the entire community and is used to develop support for other types of biomedical data sets. It is anticipated that access to the experimental data sets will enhance the paradigm shift in the community towards a much more dynamic body of continuously improving data analysis.Entities:
Mesh:
Substances:
Year: 2016 PMID: 26947396 PMCID: PMC4786681 DOI: 10.1038/ncomms10882
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Data science standards.
| Disclosure | Software tools developed under this program will be incorporated into open source software and released to the community. Manuscripts and white papers describing various phases of the project will be released on a regular basis. |
| Adoption | All biomedical image data will be converted to the master formats, such as OME-TIFF or HDF5. Community tools to create, analyse, and manipulate diffraction images will be extended to include support for these formats. All biomedical data are assigned Digital Object Identifiers through the CDL EZID system, and follow modified DataCite and Dataverse metadata schemas. Associated metadata are registered with the International DOI Foundation, making it virtually permanent and independent of SBGrid and Harvard computing infrastructure. All data sets published through the SBDG will be citable using Force 11 recommendations. |
| Transparency | Files within individual data sets will be deposited in their original format (no archives or encryption allowed). Self-documentation: The majority of diffraction data sets are self-documented and include the basic information required for reprocessing in the images themselves. Additional information will be collected during deposition and will include data set representation (the ability to use the data to be processed), reference (relation to PDB files, publications, and other data sets), context (for example, a native data set or a derivative used for phasing), fixity (checksums), and provenance (typically the data collection facility and the project member who deposits the original data set). With conversion to master formats, all secondary information will be appended to the image metadata along with all original headers. |
| External dependencies | The ability to reprocess some older data sets and verify master format conversions could depend on access to a specific version of data processing software. As data sets enter our repository, they will be reprocessed with our Data Reprocessing Pipeline (one of several we will develop as part of our Data Mining Pipelines). Data Reprocessing Pipelines will be archived within our system, issued DOIs, and interlinked with the data sets. It is worth noting that, since 2002, SBGrid has been archiving structural biology applications and, therefore, has access to previous software versions that might be required to reprocess older data sets. |
| Licensing | Biomedical data sets will be deposited under the Creative Commons Zero licence, supporting future development of data validation services and database replications and migrations. |
| Technical protection mechanism | The security of the deposited data will be maintained by the DAA. The DAA will join with the Library of Congress sponsored NDSA and the data architect working on the project will ensure that NDSA recommendations are being followed. |
NDSA, National Digital Stewardship Alliance; SBDG, Structural Biology Data Grid.
Figure 1Data collection statistics for the pilot subset of 112 data sets.
(a,b) Data sets were collected from synchrotrons on four continents (in addition to laboratory sources, which are not broken down geographically) and originate from eleven synchrotron facilities: Advanced Light Source, Advanced Photon Source, Australian Synchrotron, Cornell High Energy Synchrotron Source, Canadian Light Source, European Synchrotron Radiation Facility, National Synchrotron Light Source, National Synchrotron Radiation Research Center, Swiss Light Source, Shanghai Synchrotron Radiation Facility, and Stanford Synchrotron Radiation Lightsource. World map image courtesy of the U.S. Geological Survey. (c) Breakdown of data sets collected at the Advanced Photon Source beamlines. (d) Data sets cover a range of detector types, including Area Detector Systems Corporation M300, Q210 and Q315, Rayonix MarMosaic, Dectris Pilatus 2M and 6M, R-AXIS HTC, and MAR345.
Figure 2Estimation of storage requirements for different stages of the structural biology pipeline, based on the SBDG pilot collection.
For structure factor amplitudes and PDB models file sizes were obtained from a subset of 96 PDB depositions derived from the pilot data sets. On average, SBDG stores 1.26 data sets per PDB file. Numbers in red indicated the estimated storage requirements to accommodate data sets for 100,000 structures. We estimate that for each primary data set, additional 100 data sets are collected at national facilities. Primary data refers to original experimental diffraction images supporting the derived structural model, as distinguished from all experimental data (screening images, inferior quality data sets, and so on). For crystallographic experiments, reduced data refers to the integrated intensities (or amplitudes, which do not materially affect storage requirements).
Figure 3Organized display of data collections at SBDG.
(a) Graphical view of Laboratory and Institutional Collections within the SBDG; (b) PV structure viewer, displaying a published model with links to its two primary deposited data sets.
Figure 4SBDG persistent data set landing page (the target of a DOI resolver for a published data set).
Data set metadata are displayed, as are instructions for downloading and verifying the data set.
Figure 5Experimental data flow and publication.
(a) Flow of Primary Experimental Data. Data sets collected at synchrotrons are moved to end-users' computers for processing and structure determination. Subsequently refined macromolecular models are deposited at PDB and primary data is uploaded to SBDG. From SBDG, data sets are replicated to DAA centres and eventually copied to DAA Satellites. End-users can access data sets by downloading from DAA centres and by direct access from Satellites. (b) Flowchart for data publication.
Figure 6DataCite metadata schema used for primary data sets within the SBDG.
Information associated with the DOI record for a primary data set through the EZID system.
Figure 7Data publication guidelines.
(a) Flowchart illustrating publication guidelines incorporating software and data citations. (b) Data Citation guidelines, adapted from Dataverse Best Practices Guidelines that were developed based on Force 11 Joint Declaration of Data Citation Principles.
Figure 8Reprocessing of X-ray diffraction data sets.
(a) Analysis of 110 X-ray diffraction data sets that supported previously published PDB coordinates. Most of the failures (represented in red) were due to inaccurate or incomplete image-header information. In several of these cases, depositors provided annotations correcting this information; (b) Comparison of resolution determined by automated xia2 reprocessing with published resolution. Includes data sets not used for final refinement of published structures; (c) Shift in direct beam position from image headers and refined value following successful reprocessing with xia2.
Reference subset.
| 10.15785/SBGRID/5Boggon LaboratoryReference Case 1:MR/Multi-crystal averaging. | Data sets from 5 crystals of SNX17 FERM domain in complex with a peptide corresponding to KRIT1's NPxY2 motif. Separate integration of the data sets and scaling together allows a complete 3.0 Å data set for molecular replacement solution (original paper used 4GXB as a search model) and structure refinement. |
| 10.15785/SBGRID/117Baxter LaboratoryReference Case 2:MR/Low resolution, twinned with rotational pseudosymmetry. | 3.70 Å data set collected on a crystal of thioester-containing protein 1 *S1 allele (TEP1*S1). Initial data processing suggested P43212, but one of the two molecules (∼1300 aa. each) in the ASU overlapped with its symmetry-mate. Comparison of alternative scenarios in refinement identified the true space group as |
| 10.15785/SBGRID/62Modis LaboratoryReference case 3:U SAD/Low resolution. | 4.5 Å data set of a uranyl acetate derivative used for a challenging structure determination by SAD. Certain images had streaky features and were excluded from data reprocessing. The height and definition of peaks in anomalous difference Patterson maps was improved by omitting certain images near the end of the data collection run. |
| 10.15785/SBGRID/111 Ferré-D'Amaré LaboratoryReference Case 4:Ba/K SAD; 91 nt RNA-chromophore complex. | 2.5 Å data set collected at ALS BL 5.0.2 using 6.0 keV X-rays from a crystal of 'Spinach' a fluorescent RNA analogue of GFP. Although anomalous signal was very weak, a heavy atom substructure comprised of one barium and six potassium ions resulted in good quality SAD electron density maps. |
| 10.15785/SBGRID/3Sliz LaboratoryReference Case 5:Zn SAD; 4 Zn/ASUprotein/RNA complex. | 2.9 Å Zn SAD data set was sufficient to determine a crystal structure of Lin28/let-7d protein-microRNA complex. X-ray beam size was adjusted to maximize flux and minimize radiation damage. One swapped-dimer is located in each asymmetric unit. Two native zinc atoms are located in each tandem CCHC zinc knuckles domain. |
| 10.15785/SBGRID/123Heldwein LaboratoryReference Case 6:3.29-Å SeMet SAD9 Se/ASU | This 3.29-Å selenomethionine SAD data set, collected at 0.9789 Å wavelength at BNL X25 beamline, was sufficient to determine the phases and to trace the structure of HSV-2 gH/gL complex |
| 10.15785.SBGRID/179Schwartz LaboratoryReference Case 7:MR-SAD at 7.0 Å | Contaminating |
| 10.15785/SBGRID/21810.15785/SBGRID/78Rudenko LaboratoryReference Case 8:MR-SAD at 2.65 Å(44 Se atoms/ASU) | 3.25 Å data set (#218) from a crystal of the selenomethionyl neurexin 1alpha ectodomain and 2.65 Å higher resolution native data set (#78), both collected at APS using multiple settings. The structure has 2 molecules/ASU with a total of 14 ordered domains and ∼2,000 residues. Molecular replacement successfully placed 8 LNS domains (using a single LNS domain as a search model, i.e. ∼9% of the scattering mass) generating phases which could be used to reveal 37 out of 44 Se atoms/ASU in the 3.25 Å SeMet SAD data set. Refinement was completed using data set #78. |
| 10.15785/SBGRID/9Tao LaboratoryReference case 9:3.25 Å data set used for MR with a 9-Å cryo-EM envelope | A 3.25-Å resolution data set was collected at APS LS-CAT. The structure was determined by molecular replacement using a 9-Å resolution cryo-EM reconstruction as a phasing model. Solvent flattening and 15-fold noncrystallographic symmetry averaging were applied during phase extension. |
| 10.15785/SBGRID/83Drennan LaboratoryReference Case 10:MR/large unit cell, anisotropic. | Diffraction data from different regions of a crystal of Isobutyryl-coenzyme A mutase fused, a 250 kDa dimeric enzyme. This crystal had a large unit cell ( |
| 10.15785/SBGRID/125Kruse Laboratory(data collected in Kobilka Laboratory)Reference Case 11:MR, lipidic cubic phase | Diffraction data for lipidic cubic phase crystals of human M2 muscarinic acetylcholine receptor bound to the agonist iperoxo, the allosteric modulator LY2119620, and the conformationally-selective nanobody Nb9-8. |
| DOI:10.15785/SBGRID/68Fraser LaboratoryReference case 12:X-ray diffuse scattering | 1.2 Å data set collected at SSRL provides a high-resolution standard data set of the enzyme Cyclophilin to examine the influence of data collection temperature to compare with XFEL data, and to measure X-ray diffuse scattering. |
MR, molecular replacement; SAD, Single-wavelength Anomalous Diffraction.
12 X-ray diffraction data sets from the SBDG pilot collection were identified as particularly suitable for software testing and teaching activities. In addition, data sets from molecular dynamics, lattice light-sheet microscopy and MicroED represent an invaluable subset.