Literature DB >> 28250936

Sharing Big Data.

Marek Grabowski1, Wladek Minor1.   

Abstract

Macromolecular Big Data provide numerous challenges and a number of initiatives that are starting to overcome these issues are discussed.

Entities:  

Keywords:  Big Data; data sharing; metadata; open science; reproducibility

Year:  2017        PMID: 28250936      PMCID: PMC5331460          DOI: 10.1107/S2052252516020364

Source DB:  PubMed          Journal:  IUCrJ        ISSN: 2052-2525            Impact factor:   4.769


Experimental reproducibility is the cornerstone of scientific research, upon which all progress rests. However, recent systematic surveys have revealed that a large fraction of representative sets of studies published in biomedical journals cannot be reproduced in another laboratory. This increased focus on reproducibility has likely contributed to the growing rate of retractions among scientific publications (Cokol et al., 2008 ▸). In contrast to many other areas of biomedical research, macromolecular crystallography has always been at the forefront of ‘reproducible research’ and ‘open science’, long before these approaches became widely appreciated and practiced. From the outset, crystallographers have embraced two fundamental tenets regarding crystallographic data: the preservation of relevant ‘data’ and making the data publicly available. Initially, the relevant data – the ‘D’ in the PDB (Protein Data Bank) – was limited to atomic coordinates, and was later supplemented by the ‘header’ containing the metadata describing the parameters of data collection (Berman et al., 2000 ▸). Since the 1980s, deposition of structure coordinates into the PDB has been a requirement for the publication of a structure in scientific journals. As of 2006, structural deposits include structure factors, which permit recalculation of electron density maps. Yet the primary, ‘raw’ data of macromolecular crystallography, the sets of X-ray diffraction images used to derive the structure-factor files and atomic coordinates, typically have not been preserved, or if they have been preserved, have not been publicly available. In some cases, these data have been retained in ‘data silos’ reportedly kept by synchrotron facilities, individual crystallographers, or pharma companies. It is the experience of the authors of this commentary that only a very small fraction of requests for original diffraction images sent directly to authors of structures resulted in access to the data. Traditionally, several factors have been regarded as very difficult challenges for creation of public repositories of diffraction data: (1) the sheer size of the data, which is 2–3 orders of magnitude greater than structure factors, (2) difficulties in organizing, acquiring, curating and managing the associated metadata, and (3) the deep-rooted tendency of researchers to keep their data private. Progress in storage technologies has almost eliminated the first challenge – the cost of hardware required to accommodate diffraction data has dropped significantly. The other two challenges still remain formidable, as discussed in the paper by Kroon-Batenburg, Helliwell, McMahon and Terwilliger in this issue of IUCrJ (Kroon-Batenburg et al., 2017 ▸). The authors are long-term advocates of raw diffraction data preservation and are founding members of the Diffraction Data Working Group (DDWG). The paper presents an overview of several initiatives that have emerged in recent years to create public repositories of diffraction data and the diverse ways in which they approach these challenges. The initiatives surveyed in the paper include dedicated diffraction data repositories such as the SBGrid (Meyer et al., 2016 ▸) and the IRRMC (Grabowski et al., 2016 ▸); institutional repositories such as the one run by the University of Manchester (Tanley, Schreurs, Kroon-Batenburg & Helliwell, 2012 ▸); general-purpose repositories for scientific data such as Zenodo and Research Gate; and synchrotron repositories such as the Synchrotron.Store (Meyer et al., 2014 ▸). Between themselves, these resources now contain about 6600 publicly available diffraction datasets, corresponding to nearly 3600 diffraction experiments which have resulted in a PDB deposition. In addition, there are likely thousands more datasets sequestered in ‘dark data’, non-public data silos at synchrotron facilities or big pharma. Altering the traditional reluctance of researchers to share their data may best be addressed by funding-agency mandates stipulating that all data supporting publicly funded publications should be made publicly available. The two largest public repositories (Grabowski et al., 2016 ▸, Meyer et al., 2016 ▸) have used a variety of incentives to attract submissions from the community, amassing significant numbers of diffraction experiments (3118 and 253, respectively). All this data would be, however, largely unusable without essential metadata, such as the identity of the sample and data collection parameters, which are necessary for optimal data reduction and structure determination. The latter is usually recorded in the headers of diffraction images – in one of the more than 200 formats defined by manufacturers of detectors and some synchrotron beamlines. The problem is that the metadata in the headers are sometimes missing, inconsistent, or just plain wrong. Herein lies the ‘metadata challenge’, common to many branches of science. A number of ongoing efforts are directed toward creating scientific ontologies and standardizing metadata (Ashburner et al., 2000 ▸; Musen et al., 2015 ▸). As the paper explains, structural biology has been in the vanguard of such efforts. The Crystallographic Information Framework (CIF) defined in the 1990s has created an ontology for structural biology (including the imgCIF dictionary for data collection) and defined a ‘holistic metadata framework for crystallography’. Unfortunately, as the authors note, perhaps with a bit of understatement, ‘not all real-world workflows use CIF as their actual mechanism for capturing data and metadata’. The ultimate goal of an ideal repository is to provide a full description of diffraction experiment in the form that would allow others to easily perform data reduction and structure re-determination for every new deposit in the PDB. A re-examination of raw data and/or structure factors may bring a significant improvement in structure quality, as shown by the case of cisplatin-protein complexes discussed in the paper. In particular, a re-processing of the original diffraction data for cisplatin bound to hen egg lysozyme (Tanley, Schreurs, Kroon-Batenburg, Meredith et al., 2012 ▸) accessible at the University of Manchester repository, and the subsequent re-refinement by a different group, improved the resolution of the crystal structure from 2.4 to 2.0 Å (Shabalin et al., 2015 ▸) and a subsequent follow-up by the original group was able to improve resolution further to 1.7 Å (Tanley et al., 2016 ▸). The ‘data debate’ which ensued in this case provides a nice illustration of the benefits of sharing data and trying different interpretations, even though, as the authors emphasize there is currently a lack of uniform community standards as to what is ‘the best’ interpretation. In this particular case, data sharing gave the same apparent improvement as could have been gained by performing new data collection on a hypothetical new powerful detector on a super-modern beamline and on a new generation synchrotron. An often-overlooked Achilles heel of most biomedical databases and repositories is that negative results are often very well hidden or non-existent. The diffraction experiment that results in one protein–inhibitor complex is often the outcome of hundreds of diffraction data sets. The IRRMC has a mechanism that allows the deposition of data sets that did not bring expected results (for example an empty active site). For obvious reasons, such datasets not only have to be associated with a detailed description of the protein, but must also include which compounds were introduced and via which method. The negative information from screening results could identify ‘blind alleys’ and significantly speed-up drug discovery. None of the current diffraction data repositories provides much, if any, information about the macromolecular sample. Ideally, one would like to have structural data integrated with data from expression, purification, and crystallization – as well as information about biomedical experiments. Creation of such integrated resources remains a dream that may be a major goal of BD2K type of program.
  9 in total

1.  Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors:  M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal:  Nat Genet       Date:  2000-05       Impact factor: 38.330

2.  Crystallography and chemistry should always go together: a cautionary tale of protein complexes with cisplatin and carboplatin.

Authors:  Ivan Shabalin; Zbigniew Dauter; Mariusz Jaskolski; Wladek Minor; Alexander Wlodawer
Journal:  Acta Crystallogr D Biol Crystallogr       Date:  2015-08-28

3.  Believe it or not: how much can we rely on published data on potential drug targets?

Authors:  Florian Prinz; Thomas Schlange; Khusru Asadullah
Journal:  Nat Rev Drug Discov       Date:  2011-08-31       Impact factor: 84.694

4.  A public database of macromolecular diffraction experiments.

Authors:  Marek Grabowski; Karol M Langner; Marcin Cymborowski; Przemyslaw J Porebski; Piotr Sroka; Heping Zheng; David R Cooper; Matthew D Zimmerman; Marc André Elsliger; Stephen K Burley; Wladek Minor
Journal:  Acta Crystallogr D Struct Biol       Date:  2016-10-28       Impact factor: 7.652

5.  Room-temperature X-ray diffraction studies of cisplatin and carboplatin binding to His15 of HEWL after prolonged chemical exposure.

Authors:  Simon W M Tanley; Antoine M M Schreurs; Loes M J Kroon-Batenburg; John R Helliwell
Journal:  Acta Crystallogr Sect F Struct Biol Cryst Commun       Date:  2012-10-26

6.  Structural studies of the effect that dimethyl sulfoxide (DMSO) has on cisplatin and carboplatin binding to histidine in a protein.

Authors:  Simon W M Tanley; Antoine M M Schreurs; Loes M J Kroon-Batenburg; Joanne Meredith; Richard Prendergast; Danielle Walsh; Patrick Bryant; Colin Levy; John R Helliwell
Journal:  Acta Crystallogr D Biol Crystallogr       Date:  2012-04-17

7.  The center for expanded data annotation and retrieval.

Authors:  Mark A Musen; Carol A Bean; Kei-Hoi Cheung; Michel Dumontier; Kim A Durante; Olivier Gevaert; Alejandra Gonzalez-Beltran; Purvesh Khatri; Steven H Kleinstein; Martin J O'Connor; Yannick Pouliot; Philippe Rocca-Serra; Susanna-Assunta Sansone; Jeffrey A Wiser
Journal:  J Am Med Inform Assoc       Date:  2015-06-25       Impact factor: 4.497

8.  Operation of the Australian Store.Synchrotron for macromolecular crystallography.

Authors:  Grischa R Meyer; David Aragão; Nathan J Mudie; Tom T Caradoc-Davies; Sheena McGowan; Philip J Bertling; David Groenewegen; Stevan M Quenette; Charles S Bond; Ashley M Buckle; Steve Androulakis
Journal:  Acta Crystallogr D Biol Crystallogr       Date:  2014-09-30

9.  Data publication with the structural biology data grid supports live analysis.

Authors:  Peter A Meyer; Stephanie Socias; Jason Key; Elizabeth Ransey; Emily C Tjon; Alejandro Buschiazzo; Ming Lei; Chris Botka; James Withrow; David Neau; Kanagalaghatta Rajashankar; Karen S Anderson; Richard H Baxter; Stephen C Blacklow; Titus J Boggon; Alexandre M J J Bonvin; Dominika Borek; Tom J Brett; Amedeo Caflisch; Chung-I Chang; Walter J Chazin; Kevin D Corbett; Michael S Cosgrove; Sean Crosson; Sirano Dhe-Paganon; Enrico Di Cera; Catherine L Drennan; Michael J Eck; Brandt F Eichman; Qing R Fan; Adrian R Ferré-D'Amaré; J Christopher Fromme; K Christopher Garcia; Rachelle Gaudet; Peng Gong; Stephen C Harrison; Ekaterina E Heldwein; Zongchao Jia; Robert J Keenan; Andrew C Kruse; Marc Kvansakul; Jason S McLellan; Yorgo Modis; Yunsun Nam; Zbyszek Otwinowski; Emil F Pai; Pedro José Barbosa Pereira; Carlo Petosa; C S Raman; Tom A Rapoport; Antonina Roll-Mecak; Michael K Rosen; Gabby Rudenko; Joseph Schlessinger; Thomas U Schwartz; Yousif Shamoo; Holger Sondermann; Yizhi J Tao; Niraj H Tolia; Oleg V Tsodikov; Kenneth D Westover; Hao Wu; Ian Foster; James S Fraser; Filipe R N C Maia; Tamir Gonen; Tom Kirchhausen; Kay Diederichs; Mercè Crosas; Piotr Sliz
Journal:  Nat Commun       Date:  2016-03-07       Impact factor: 14.919

  9 in total
  1 in total

1.  Scientist and data architect collaborate to curate and archive an inner ear electrophysiology data collection.

Authors:  Brenda Farrell; Jason Bengtson
Journal:  PLoS One       Date:  2019-10-18       Impact factor: 3.240

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.