Literature DB >> 28387199

GSA: Genome Sequence Archive<sup/>.

Yanqing Wang¹, Fuhai Song², Junwei Zhu¹, Sisi Zhang¹, Yadong Yang², Tingting Chen¹, Bixia Tang³, Lili Dong¹, Nan Ding⁴, Qian Zhang⁴, Zhouxian Bai², Xunong Dong², Huanxin Chen¹, Mingyuan Sun¹, Shuang Zhai¹, Yubin Sun¹, Lei Yu¹, Li Lan¹, Jingfa Xiao⁵, Xiangdong Fang⁶, Hongxing Lei⁷, Zhang Zhang⁸, Wenming Zhao⁹.

Abstract

With the rapid development of sequencing technologies towards higher throughput and lower cost, sequence data are generated at an unprecedentedly explosive rate. To provide an efficient and easy-to-use platform for managing huge sequence data, here we present Genome Sequence Archive (GSA; http://bigd.big.ac.cn/gsa or http://gsa.big.ac.cn), a data repository for archiving raw sequence data. In compliance with data standards and structures of the International Nucleotide Sequence Database Collaboration (INSDC), GSA adopts four data objects (BioProject, BioSample, Experiment, and Run) for data organization, accepts raw sequence reads produced by a variety of sequencing platforms, stores both sequence reads and metadata submitted from all over the world, and makes all these data publicly available to worldwide scientific communities. In the era of big data, GSA is not only an important complement to existing INSDC members by alleviating the increasing burdens of handling sequence data deluge, but also takes the significant responsibility for global big data archive and provides free unrestricted access to all publicly available data in support of research activities throughout the world.

Entities: CellLine Chemical Gene Species

Keywords: Big data; GSA; Genome Sequence Archive; INSDC; Raw sequence data

Mesh：

Year: 2017 PMID： 28387199 PMCID： PMC5339404 DOI： 10.1016/j.gpb.2017.01.001

Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN： 1672-0229 Impact factor: 7.691

Introduction

Next-generation sequencing (NGS) technologies have been extensively and routinely applied to a wide range of important issues in life and health sciences, leading to an unprecedented explosion in sequence data. Considering the increasingly higher throughput and lower costs attributable to rapid advancements of NGS technologies, large-scale sequencing projects for population genomics and precision medicine are ongoing or in the planning stages around the world, e.g., the US Precision Medicine Initiative (PMI) [1], UK10 K Project [2], Icelandic Population Genome Project [3], and Dog 10 K Project [4]. As a corollary, such deluge of sequencing data poses great challenges in big data deposition, integration, and translation [5], [6]. Accordingly, it is fundamentally crucial to store and manage sequencing data in support of integrative in-depth analyses and large-scale data mining. The International Nucleotide Sequence Database Collaboration (INSDC) [7] operating between the DNA Data Bank of Japan (DDBJ) [8], the European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI) [9], and the National Center for Biotechnology Information (NCBI) [10], provides valuable services for archiving a broad spectrum of sequence data. However, with the exponentially accumulating volume of sequence data, submitting big data to INSDC database resources becomes increasingly daunting and time-consuming, simply because network bandwidth is a formidable bottleneck for big data transfer across countries/regions. This situation is particularly severer in China; to our experience, for instance, submission of ∼1 terabyte (TB) data to the NCBI Sequence Read Archive (SRA) takes ∼2 weeks based on the 150-Mbps upload bandwidth over a shared international network in Beijing Institute of Genomics (BIG), Chinese Academy of Sciences (CAS). China, with the increasing funding support in biomedical research, has been a powerhouse in generating enormous amounts of sequencing data. Given the huge population and rich biodiversities in China, it is undoubted that data generated from sequencing projects for the Chinese population (e.g., CAS PMI at http://news.xinhuanet.com/english/2016–01/09/c_134993997.htm) and domestically featured species would be growing strikingly at extraordinarily exponential rates, which accordingly brings an insurmountable challenge and burden to current practice of data submission and sharing. To address this issue, here we present Genome Sequence Archive (GSA; http://bigd.big.ac.cn/gsa or http://gsa.big.ac.cn), a data repository for archiving raw sequence data. As a core database resource of BIG Data Center [11] (http://bigd.big.ac.cn), GSA is built based on INSDC data standards and structures and provides data archival services for scientific communities not only in China but also throughout the world. GSA accepts raw sequence reads produced by a variety of sequencing platforms, stores both sequence reads and metadata, and provides free and unrestricted access to all publicly available data for worldwide scientific communities.

Implementation

GSA is implemented with Java Server Pages (JSP; a Java programming framework for constructing dynamic web pages), Spring (an application framework and inversion of control container; http://www.springsource.org), Struts (a Model-View-Controller framework for creating Java web applications; http://struts.apache.org), and MyBatis (a persistence framework for the database connection and operation; http://www.mybatis.org). GSA adopts MySQL (http://www.mysql.org) as relational database management system to store metadata information. All codes are developed using Eclipse (http://www.eclipse.org), an integrated development environment (IDE) that features rapid development of Java-based web applications. To provide stable web services, GSA is hosted on a CentOS-7 operating system with four servers, namely, Apache serving static content, Tomcat serving dynamic content, a MySQL server for database management, as well as a FTP server for file upload and download.

Database content and usage

Data structure and organization

Designed for compatibility, GSA follows INSDC data standards and structures. All data are organized into four objects, i.e., BioProject, BioSample, Experiment, and Run (Figure 1). “BioProject”, bearing an accession number prefixed with “PRJC” (where C, hereinafter, stands for China), provides an overall description for an individual research initiative, including basic description, organism, data type, submitter, funding information, and publication(s) if available. “BioSample”, possessing an accession number prefixed with SAMC, contains descriptive information about biological materials used in the experiments, including sample types and attributes. “Experiment”, having an accession number prefixed with CRX, provides a detailed description of treatments for a specific BioSample, including experiment intention, library method, and sequencing type. “Run”, adopting an accession number prefixed with CRR, includes a list of sequence data file(s) related to a specific experiment. It is noted that “Experiment” and “Run” constitute China Read Archive (CRA). Based on these standardized data objects, GSA not only facilitates data submission and deposition, but also enables data sharing and exchange.

Figure 1

Data model in GSAPrefixes of accession numbers for data objects, including BioProject, BioSample, Experiment, and Run, are indicated in red. Data objects Experiment and Run constitute China Read Archive.

In addition, GSA features umbrella projects and provides an organizational structure for a large collaborative project consisting of multiple sub-projects that are funded by a same grant and have very close collaborations. GSA is well supported by CAS that functions as the national scientific think tank and academic governing body. Currently, two umbrella projects from CAS Strategic Priority Research Programs and one CAS Key Research Program make it officially mandatory to submit sequencing data to GSA.

Data archive and statistics

GSA accepts data submissions from all over the world, covers the spectrum of sequence reads generated by a variety of sequencing platforms, and accommodates several commonly-used file formats, like FASTQ, BAM, and VCF. GSA performs validations for all submitted data items to ensure data integrity and increase data reusability. Similar to INSDC members, GSA allows users to set data as either public or controlled, indicating that the data is publicly accessible or placed under controlled access over a given period of time, respectively. Regarding data security, all submitted data have copies stored in physically separate disks. Since its inception in August 2015, GSA presents a dramatic increase on data submissions in terms of the numbers of BioProjects and BioSamples, Experiments, and Runs, as well as file size (Figure 2). As of December 2016, GSA houses a total of 198 BioProjects, 8674 BioSamples, 9263 Experiments and 10,745 Runs for more than 80 species, submitted by more than 160 data providers from a total of 39 institutions, and archives more than 200 TBs of sequence data.

Figure 2

Data statistics of GSAA. Numbers of BioProjects and BioSamples in GSA. B. Numbers of Experiments and Runs, as well as file size in GSA. All statistics are based on data submissions ranging from December 2015 to December 2016.

Data submission and retrieval

To create a submission, users need to register and log into the GSA system. Basically, to submit data to GSA, there are five straightforward steps involving BioProject, BioSample, Experiment, Run, and Sequence Files (Figure 3). In order to maximally simplify the submission procedure, GSA is equipped with a user-friendly input wizard for metadata collection. To ease sequence file uploading, GSA provides a FTP server supporting two Internet Protocols (IPv4 and IPv6). In addition, GSA provides user-friendly web interfaces for data query and browsing. Users can search the data of interest by specifying a given BioProject, BioSample, Experiment, or Run ID. Moreover, GSA allows users to conduct advanced search by inputting species name, sequencing type, sequencing platform, disease/phenotype/trait, tissue/cell line, etc. GSA also allows users to browse all publicly available BioProjects, BioSamples, and Experiments.

Figure 3

Graphic illustration of data submissions to GSATwo representative studies are provided here as examples to depict the data objects involved in data submission.

Perspectives and concluding remarks

“With great power comes with great responsibility”. Nowadays, China is the second largest economy, playing an increasingly important and influential role in the global economy. Equally, in academia, it is time for us to implement the practice of archiving sequence data for worldwide scientific communities, especially considering the larger quantities of sequence data generated in China. Equivalent to INSDC members, GSA is committed to archiving raw sequence data. GSA’s ultimate goal, which is also the expectation from funding agencies, is to provide free archival services for raw sequence data, establish and promote a centralized archival practice in China, play an important role in global sequence data archive, and support research activities in both academia and industry throughout the world. In addition, there are also strong domestic incentives and agreements from academia, industry, and government (over 1000 supporters from more than 380 organizations; http://bigd.big.ac.cn/gdsd) to deposit data into GSA and make GSA a centralized archival resource in China. To sum up, GSA is a data repository for archiving raw sequence data. Designed for compatibility, GSA adopts INSDC data standards and structures, archives both sequence reads and metadata submitted from all over the world, and makes all these data publicly available to worldwide scientific communities. In the era of big data, GSA is not only an important complement to existing INSDC members by alleviating the increasing burdens of handling sequence data deluge, but also takes the significant responsibility for global big data archive and provides free unrestricted access to all publicly available data in support of research activities throughout the world. In future, we will not only upgrade infrastructure of GSA to achieve big data storage, exchange and sharing, but also will develop new functionalities to archive population-based PMI data and a variety of metagenome data.

Authors’ contributions

WZ, ZZ, HL, and XF conceived of the idea and supervised the project. WZ, YW, and BT designed the system architecture. YW, JZ, FS, YY and ZB wrote the source code. QZ, ND, TC and XD tested the system. TC, LD and SSZ conducted data quality control and provided feedback service. HC, MS, YS, SZ, LL and LY constructed and maintained the network and hardware infrastructure. WZ, SSZ, and ZZ drafted the manuscript. ZZ, WZ and JX revised the manuscript. All authors read and approved the final manuscript.

Competing interests

The authors have declared no competing interests.

10 in total

1. A new initiative on precision medicine.

Authors: Francis S Collins; Harold Varmus
Journal: N Engl J Med Date: 2015-01-30 Impact factor: 91.245

2. Large-scale whole-genome sequencing of the Icelandic population.

Authors: Daniel F Gudbjartsson; Hannes Helgason; Sigurjon A Gudjonsson; Florian Zink; Asmundur Oddson; Arnaldur Gylfason; Soren Besenbacher; Gisli Magnusson; Bjarni V Halldorsson; Eirikur Hjartarson; Gunnar Th Sigurdsson; Simon N Stacey; Michael L Frigge; Hilma Holm; Jona Saemundsdottir; Hafdis Th Helgadottir; Hrefna Johannsdottir; Gunnlaugur Sigfusson; Gudmundur Thorgeirsson; Jon Th Sverrisson; Solveig Gretarsdottir; G Bragi Walters; Thorunn Rafnar; Bjarni Thjodleifsson; Einar S Bjornsson; Sigurdur Olafsson; Hildur Thorarinsdottir; Thora Steingrimsdottir; Thora S Gudmundsdottir; Asgeir Theodors; Jon G Jonasson; Asgeir Sigurdsson; Gyda Bjornsdottir; Jon J Jonsson; Olafur Thorarensen; Petur Ludvigsson; Hakon Gudbjartsson; Gudmundur I Eyjolfsson; Olof Sigurdardottir; Isleifur Olafsson; David O Arnar; Olafur Th Magnusson; Augustine Kong; Gisli Masson; Unnur Thorsteinsdottir; Agnar Helgason; Patrick Sulem; Kari Stefansson
Journal: Nat Genet Date: 2015-03-25 Impact factor: 38.330

3. DoGSD: the dog and wolf genome SNP database.

Authors: Bing Bai; Wen-Ming Zhao; Bi-Xia Tang; Yan-Qing Wang; Lu Wang; Zhang Zhang; He-Chuan Yang; Yan-Hu Liu; Jun-Wei Zhu; David M Irwin; Guo-Dong Wang; Ya-Ping Zhang
Journal: Nucleic Acids Res Date: 2014-11-17 Impact factor: 19.160

4. Whole-genome sequence-based analysis of thyroid function.

Authors: Peter N Taylor; Eleonora Porcu; Shelby Chew; Purdey J Campbell; Michela Traglia; Suzanne J Brown; Benjamin H Mullin; Hashem A Shihab; Josine Min; Klaudia Walter; Yasin Memari; Jie Huang; Michael R Barnes; John P Beilby; Pimphen Charoen; Petr Danecek; Frank Dudbridge; Vincenzo Forgetta; Celia Greenwood; Elin Grundberg; Andrew D Johnson; Jennie Hui; Ee M Lim; Shane McCarthy; Dawn Muddyman; Vijay Panicker; John R B Perry; Jordana T Bell; Wei Yuan; Caroline Relton; Tom Gaunt; David Schlessinger; Goncalo Abecasis; Francesco Cucca; Gabriela L Surdulescu; Wolfram Woltersdorf; Eleftheria Zeggini; Hou-Feng Zheng; Daniela Toniolo; Colin M Dayan; Silvia Naitza; John P Walsh; Tim Spector; George Davey Smith; Richard Durbin; J Brent Richards; Serena Sanna; Nicole Soranzo; Nicholas J Timpson; Scott G Wilson
Journal: Nat Commun Date: 2015-03-06 Impact factor: 14.919

5. Precision Medicine: What Challenges Are We Facing?

Authors: Yu Xue; Eric-Wubbo Lameijer; Kai Ye; Kunlin Zhang; Suhua Chang; Xiaoyue Wang; Jianmin Wu; Ge Gao; Fangqing Zhao; Jian Li; Chunsheng Han; Shuhua Xu; Jingfa Xiao; Xuerui Yang; Xiaomin Ying; Xuegong Zhang; Wei-Hua Chen; Yun Liu; Zhang Zhang; Kun Huang; Jun Yu
Journal: Genomics Proteomics Bioinformatics Date: 2016-10-13 Impact factor: 7.691

6. The International Nucleotide Sequence Database Collaboration.

Authors: Guy Cochrane; Ilene Karsch-Mizrachi; Toshihisa Takagi
Journal: Nucleic Acids Res Date: 2015-12-10 Impact factor: 16.971

7. The BIG Data Center: from deposition to integration to translation.

Authors:
Journal: Nucleic Acids Res Date: 2016-11-28 Impact factor: 16.971

8. DNA data bank of Japan (DDBJ) progress report.

Authors: Jun Mashima; Yuichi Kodama; Takehide Kosuge; Takatomo Fujisawa; Toshiaki Katayama; Hideki Nagasaki; Yoshihiro Okuda; Eli Kaminuma; Osamu Ogasawara; Kousaku Okubo; Yasukazu Nakamura; Toshihisa Takagi
Journal: Nucleic Acids Res Date: 2015-11-17 Impact factor: 16.971

9. Database resources of the National Center for Biotechnology Information.

Authors:
Journal: Nucleic Acids Res Date: 2015-11-28 Impact factor: 16.971

10. The European Bioinformatics Institute in 2016: Data growth and integration.

Authors: Charles E Cook; Mary Todd Bergman; Robert D Finn; Guy Cochrane; Ewan Birney; Rolf Apweiler
Journal: Nucleic Acids Res Date: 2015-12-15 Impact factor: 16.971

10 in total

201 in total

1. MYC2 Orchestrates a Hierarchical Transcriptional Cascade That Regulates Jasmonate-Mediated Plant Immunity in Tomato.

Authors: Minmin Du; Jiuhai Zhao; David T W Tzeng; Yuanyuan Liu; Lei Deng; Tianxia Yang; Qingzhe Zhai; Fangming Wu; Zhuo Huang; Ming Zhou; Qiaomei Wang; Qian Chen; Silin Zhong; Chang-Bao Li; Chuanyou Li
Journal: Plant Cell Date: 2017-07-21 Impact factor: 11.277

2. LEUNIG_HOMOLOG Mediates MYC2-Dependent Transcriptional Activation in Cooperation with the Coactivators HAC1 and MED25.

Authors: Yanrong You; Qingzhe Zhai; Chunpeng An; Chuanyou Li
Journal: Plant Cell Date: 2019-07-18 Impact factor: 11.277

3. Relating Clans Ao and Aisin Gioro from northeast China by whole Y-chromosome sequencing.

Authors: Chi-Zao Wang; Lan-Hai Wei; Ling-Xiang Wang; Shao-Qing Wen; Xue-Er Yu; Mei-Sen Shi; Hui Li
Journal: J Hum Genet Date: 2019-05-31 Impact factor: 3.172

4. Transcriptome profiles reveal that gibberellin-related genes regulate weeping traits in crape myrtle.

Authors: Suzhen Li; Tangchun Zheng; Xiaokang Zhuo; Zhuojiao Li; Lulu Li; Ping Li; Like Qiu; Huitang Pan; Jia Wang; Tangren Cheng; Qixiang Zhang
Journal: Hortic Res Date: 2020-04-01 Impact factor: 6.793

5. Comprehensive profiling of circular RNAs with nanopore sequencing and CIRI-long.

Authors: Jinyang Zhang; Lingling Hou; Zhenqiang Zuo; Peifeng Ji; Xiaorong Zhang; Yuanchao Xue; Fangqing Zhao
Journal: Nat Biotechnol Date: 2021-03-11 Impact factor: 54.908

6. Phylogeny of Y-chromosome haplogroup C3b-F1756, an important paternal lineage in Altaic-speaking populations.

Authors: Lan-Hai Wei; Yun-Zhi Huang; Shi Yan; Shao-Qing Wen; Ling-Xiang Wang; Pan-Xin Du; Da-Li Yao; Shi-Lin Li; Ya-Jun Yang; Li Jin; Hui Li
Journal: J Hum Genet Date: 2017-06-01 Impact factor: 3.172

7. Development and application of EST-SSRs markers for analysis of genetic diversity in erect milkvetch (Astragalus adsurgens Pall.).

Authors: Wenlong Gong; Lin Ma; Pan Gong; Xiqiang Liu; Zan Wang; Guiqin Zhao
Journal: Mol Biol Rep Date: 2018-11-15 Impact factor: 2.316

8. m⁶A modulates haematopoietic stem and progenitor cell specification.

Authors: Chunxia Zhang; Yusheng Chen; Baofa Sun; Lu Wang; Ying Yang; Dongyuan Ma; Junhua Lv; Jian Heng; Yanyan Ding; Yuanyuan Xue; Xinyan Lu; Wen Xiao; Yun-Gui Yang; Feng Liu
Journal: Nature Date: 2017-09-06 Impact factor: 49.962

9. Viral Perturbation of Alternative Splicing of a Host Transcript Benefits Infection.

Authors: Kaitong Du; Tong Jiang; Hui Chen; Alex M Murphy; John P Carr; Zhiyou Du; Xiangdong Li; Zaifeng Fan; Tao Zhou
Journal: Plant Physiol Date: 2020-09-21 Impact factor: 8.340

10. Chromosome-scale assembly of the Kandelia obovata genome.

Authors: Min-Jie Hu; Wei-Hong Sun; Wen-Chieh Tsai; Shuang Xiang; Xing-Kai Lai; De-Qiang Chen; Xue-Die Liu; Yi-Fan Wang; Yi-Xun Le; Si-Ming Chen; Di-Yang Zhang; Xia Yu; Wen-Qi Hu; Zhuang Zhou; Yan-Qiong Chen; Shuang-Quan Zou; Zhong-Jian Liu
Journal: Hortic Res Date: 2020-05-02 Impact factor: 6.793