Literature DB >> 23587345

GigaDB: announcing the GigaScience database.

Tam P Sneddon¹, Peter Li, Scott C Edmunds.

Abstract

With the launch of GigaScience journal, here we provide insight into the accompanying database GigaDB, which allows the integration of manuscript publication with supporting data and tools. Reinforcing and upholding GigaScience's goals to promote open-data and reproducibility of research, GigaDB also aims to provide a home, when a suitable public repository does not exist, for the supporting data or tools featured in the journal and beyond.

Entities: Disease Species

Year: 2012 PMID： 23587345 PMCID： PMC3626507 DOI： 10.1186/2047-217X-1-11

Source DB: PubMed Journal: Gigascience ISSN： 2047-217X Impact factor: 6.524

Background

Internet pioneer Sir Tim Berners-Lee has stated: "Data is a precious thing and will last longer than the systems themselves” [1], and despite the challenges created due to data production in areas such as genomics growing at rates potentially faster than the ability to store and process it, attempts must still be made to capture and safeguard as much of these precious resources as possible. With the goals of GigaScience journal to maximize data reuse, dissemination, and transparency, having somewhere to host and curate all of the supporting data and tools surrounding this research is essential, and the GigaScience database, GigaDB (http://gigadb.org) is key to achieving this.

Main text

As can be seen in GigaScience’s first issue, a research article on an epigenomics pipeline [2], in addition to having the raw data available in NCBI [SRP005934], also has this and all the supporting data (totaling 84 GB), such as the epigenomics tracks and the tools created for the pipeline [3], hosted in GigaDB. This dataset is linked and cited in the paper through a citable DOI (Digital Object Identifier), providing stability, and most importantly, additional discoverability and traceability through its ability to be tracked in the same manner as standard journal citations. Working and partnering with the British Library and DataCite consortium (http://datacite.org), these datasets are searchable and harvestable through their central metadata repository. Outside of the environmental sciences, data citation is still quite a new area, and we have worked closely with our publisher BioMed Central to ensure that citation of data follows DCC and DataCite best practice guidelines. In promoting the open-data movement, data is also released under the most open CC0 waiver, cutting any legal red tape [4], and maximizing its potential re-use. As GigaDB uses BGI’s extensive computing infrastructure, it has also been populated with datasets produced by BGI, much of it released in a citable form pre-publication. Releasing data in this novel manner has had a number of successes to date, particularly spurring the crowdsourcing of data from the deadly 2011 E. coli 0104:H4 outbreak (also discussed in Mike Schatz’s commentary in this launch issue [5]) resulting in what has been termed “open-source genomics” [6]. For more on the background and mechanisms surrounding data citation, please see our recent correspondence [7] in the BMC Research Notes Data Sharing, Standardization and Publication series, using the release of the sorghum genome by GigaDB and publication in Genome Biology last year [8]. GigaDB currently comprises over 30 datasets. The largest of these is a hepatocellular carcinoma dataset [9], which consists of 15 Tb of normal and tumor raw data from 88 individuals. Additional data derived and processed from these same individuals, e.g. transcriptome sequence, can also be added to a DOI rapidly after their generation so users can immediately access the data from this ongoing project in a single, permanent place. The goal of centralizing data and making it reproducible is exemplified by the mouse methylome dataset [3] in which we provide all data necessary to replicate the published results. This includes the raw fastq reads, bam alignment files, the Medusa software package, and the bigwig read-depth files. This and the sorghum study are excellent examples for future data submitters in regards to what can be done to not only comply with but also go beyond minimal journal data policies. Authors not only adhered to our standard journal editorial policies for genomics studies, with raw data deposition in one of the three INSDC databases and assemblies in Genbank, but the sorghum study also deposited additionally processed data to the dbSNP and dbVar databases. The methylome GigaDB page also includes data and associated files that do not have equivalent established repositories. The complementary system of releasing data through GigaDB and established repositories also has the advantage of making the data available much sooner than the staggered build releases of many of these databases, which can take several months. The GigaDB website is continuing to evolve and the next version will be released later this year. Features in this version will include an extensive search interface allowing users to choose datasets and/or files for download/export by dataset type, file format, sample, species, DOI, external accession etc. Although most published GigaDB datasets are genomic, we can accept any large-scale data including proteomic, environmental, and imaging data. Taking such a broad range of data types makes data interoperability an issue, and we have been working with the ISA-Commons community to see if GigaDB can capture study and assay metadata along with relationships between dataset components and take submissions using their ISA-Tab format [10]. We have a nice example in our first issue, with much of the data supporting the epigenomics pipeline paper stored in a more interoperable ISA-compliant manner [3]. Upcoming datasets will include gut metagenomic data and a Drosophila genomics workflow dataset. We would like to be as comprehensive as possible, especially in providing a home for data that is not represented in any of the major public databases/repositories, so we encourage you contact us if you have a dataset or tools you would like to submit to GigaDB. Maximising the reuse of published data does not only involve its deposition, along with its metadata, into an open access repository in a standardised format. Results published in scientific articles also have to be reproducible so, for example, comparisons can be made with analyses on new research data [11]. In future editions of GigaScience, we will be working with authors to make the computational tools and data processing pipelines described in their papers available and, where possible, executable on an informatics platform. We hope that by making both the data and processes involved in their analysis freely accessible, this novel form of publication will help articles published in our journal to have a much higher impact in the scientific literature, and maximize their reuse within the community.

Competing interests

All authors are employees of GigaScience and BGI.

Author’ contributions

All authors have been working on GigaDB and have contributed to this editorial. All authors read and approved the final manuscript.

7 in total

1. Repeatability of published microarray gene expression analyses.

Authors: John P A Ioannidis; David B Allison; Catherine A Ball; Issa Coulibaly; Xiangqin Cui; Aedín C Culhane; Mario Falchi; Cesare Furlanello; Laurence Game; Giuseppe Jurman; Jon Mangion; Tapan Mehta; Michael Nitzberg; Grier P Page; Enrico Petretto; Vera van Noort
Journal: Nat Genet Date: 2008-01-28 Impact factor: 38.330

2. Open-source genomic analysis of Shiga-toxin-producing E. coli O104:H4.

Authors: Holger Rohde; Junjie Qin; Yujun Cui; Dongfang Li; Nicholas J Loman; Moritz Hentschke; Wentong Chen; Fei Pu; Yangqing Peng; Junhua Li; Feng Xi; Shenghui Li; Yin Li; Zhaoxi Zhang; Xianwei Yang; Meiru Zhao; Peng Wang; Yuanlin Guan; Zhong Cen; Xiangna Zhao; Martin Christner; Robin Kobbe; Sebastian Loos; Jun Oh; Liang Yang; Antoine Danchin; George F Gao; Yajun Song; Yingrui Li; Huanming Yang; Jian Wang; Jianguo Xu; Mark J Pallen; Jun Wang; Martin Aepfelbacher; Ruifu Yang
Journal: N Engl J Med Date: 2011-07-27 Impact factor: 91.245

3. Toward interoperable bioscience data.

Authors: Susanna-Assunta Sansone; Philippe Rocca-Serra; Dawn Field; Eamonn Maguire; Chris Taylor; Oliver Hofmann; Hong Fang; Steffen Neumann; Weida Tong; Linda Amaral-Zettler; Kimberly Begley; Tim Booth; Lydie Bougueleret; Gully Burns; Brad Chapman; Tim Clark; Lee-Ann Coleman; Jay Copeland; Sudeshna Das; Antoine de Daruvar; Paula de Matos; Ian Dix; Scott Edmunds; Chris T Evelo; Mark J Forster; Pascale Gaudet; Jack Gilbert; Carole Goble; Julian L Griffin; Daniel Jacob; Jos Kleinjans; Lee Harland; Kenneth Haug; Henning Hermjakob; Shannan J Ho Sui; Alain Laederach; Shaoguang Liang; Stephen Marshall; Annette McGrath; Emily Merrill; Dorothy Reilly; Magali Roux; Caroline E Shamu; Catherine A Shang; Christoph Steinbeck; Anne Trefethen; Bryn Williams-Jones; Katherine Wolstencroft; Ioannis Xenarios; Winston Hide
Journal: Nat Genet Date: 2012-01-27 Impact factor: 38.330

4. Genome-wide patterns of genetic variation in sweet and grain sorghum (Sorghum bicolor).

Authors: Lei-Ying Zheng; Xiao-Sen Guo; Bing He; Lian-Jun Sun; Yao Peng; Shan-Shan Dong; Teng-Fei Liu; Shuye Jiang; Srinivasan Ramachandran; Chun-Ming Liu; Hai-Chun Jing
Journal: Genome Biol Date: 2011-11-21 Impact factor: 13.583

5. Resources for methylome analysis suitable for gene knockout studies of potential epigenome modifiers.

Authors: Gareth A Wilson; Pawandeep Dhami; Andrew Feber; Daniel Cortázar; Yuka Suzuki; Reiner Schulz; Primo Schär; Stephan Beck
Journal: Gigascience Date: 2012-07-12 Impact factor: 6.524

6. The rise of a digital immune system.

Authors: Michael C Schatz; Adam M Phillippy
Journal: Gigascience Date: 2012-07-12 Impact factor: 6.524

7. Adventures in data citation: sorghum genome data exemplifies the new gold standard.

Authors: Scott C Edmunds; Tom J Pollard; Brian Hole; Alexandra T Basford
Journal: BMC Res Notes Date: 2012-07-02

7 in total

29 in total

1. Big data and biomedical informatics: a challenging opportunity.

Authors: R Bellazzi
Journal: Yearb Med Inform Date: 2014-05-22

2. Big heart data: advancing health informatics through data sharing in cardiovascular imaging.

Authors: Avan Suinesiaputra; Pau Medrano-Gracia; Brett R Cowan; Alistair A Young
Journal: IEEE J Biomed Health Inform Date: 2014-11-14 Impact factor: 5.772

3. Species Tree Estimation and the Impact of Gene Loss Following Whole-Genome Duplication.

Authors: Haifeng Xiong; Danying Wang; Chen Shao; Xuchen Yang; Jialin Yang; Tao Ma; Charles C Davis; Liang Liu; Zhenxiang Xi
Journal: Syst Biol Date: 2022-10-12 Impact factor: 9.160

4. A Decade of GigaScience: GigaDB and the Open Data Movement.

Authors: Chris Armit; Mary Ann Tuli; Christopher I Hunter
Journal: Gigascience Date: 2022-06-14 Impact factor: 7.658

5. Lessons learned from implementing a national infrastructure in Sweden for storage and analysis of next-generation sequencing data.

Authors: Samuel Lampa; Martin Dahlö; Pall I Olason; Jonas Hagberg; Ola Spjuth
Journal: Gigascience Date: 2013-06-25 Impact factor: 6.524

6. Knowledge and Attitudes Among Life Scientists Toward Reproducibility Within Journal Articles: A Research Survey.

Authors: Evanthia Kaimaklioti Samota; Robert P Davey
Journal: Front Res Metr Anal Date: 2021-06-29

7. Large and linked in scientific publishing.

Authors: Laurie Goodman; Scott C Edmunds; Alexandra T Basford
Journal: Gigascience Date: 2012-07-12 Impact factor: 6.524

Review 8. Silicon era of carbon-based life: application of genomics and bioinformatics in crop stress research.

Authors: Man-Wah Li; Xinpeng Qi; Meng Ni; Hon-Ming Lam
Journal: Int J Mol Sci Date: 2013-05-29 Impact factor: 5.923

9. Eupolybothrus cavernicolus Komerički & Stoev sp. n. (Chilopoda: Lithobiomorpha: Lithobiidae): the first eukaryotic species description combining transcriptomic, DNA barcoding and micro-CT imaging data.

Authors: Pavel Stoev; Ana Komerički; Nesrine Akkari; Shanlin Liu; Xin Zhou; Alexander M Weigand; Jeroen Hostens; Christopher I Hunter; Scott C Edmunds; David Porco; Marzio Zapparoli; Teodor Georgiev; Daniel Mietchen; David Roberts; Sarah Faulwetter; Vincent Smith; Lyubomir Penev
Journal: Biodivers Data J Date: 2013-10-28

10. GigaDB: promoting data dissemination and reproducibility.

Authors: Tam P Sneddon; Xiao Si Zhe; Scott C Edmunds; Peter Li; Laurie Goodman; Christopher I Hunter
Journal: Database (Oxford) Date: 2014-03-12 Impact factor: 3.451