| Literature DB >> 34400360 |
Tingting Chen1, Xu Chen1, Sisi Zhang1, Junwei Zhu1, Bixia Tang1, Anke Wang1, Lili Dong1, Zhewen Zhang1, Caixia Yu1, Yanling Sun1, Lianjiang Chi2, Huanxin Chen1, Shuang Zhai1, Yubin Sun1, Li Lan1, Xin Zhang1, Jingfa Xiao3, Yiming Bao3, Yanqing Wang4, Zhang Zhang5, Wenming Zhao6.
Abstract
The Genome Sequence Archive (GSA) is a data repository for archiving raw sequence data, which provides data storage and sharing services for worldwide scientific communities. Considering explosive data growth with diverse data types, here we present the GSA family by expanding into a set of resources for raw data archive with different purposes, namely, GSA (https://ngdc.cncb.ac.cn/gsa/), GSA for Human (GSA-Human, https://ngdc.cncb.ac.cn/gsa-human/), and Open Archive for Miscellaneous Data (OMIX, https://ngdc.cncb.ac.cn/omix/). Compared with the 2017 version, GSA has been significantly updated in data model, online functionalities, and web interfaces. GSA-Human, as a new partner of GSA, is a data repository specialized in human genetics-related data with controlled access and security. OMIX, as a critical complement to the two resources mentioned above, is an open archive for miscellaneous data. Together, all these resources form a family of resources dedicated to archiving explosive data with diverse types, accepting data submissions from all over the world, and providing free open access to all publicly available data in support of worldwide research activities.Entities:
Keywords: GSA; GSA-Human; Genome Sequence Archive; OMIX
Mesh:
Substances:
Year: 2021 PMID: 34400360 PMCID: PMC9039563 DOI: 10.1016/j.gpb.2021.08.001
Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN: 1672-0229 Impact factor: 6.409
Comparison between GSA in 2017 and the GSA family in 2021
| Archival resources | GSA | GSA, GSA-Human, OMIX |
| Number of supported sample types* | 7 | 11 |
| Batch submission | NA | Available |
| Data statistics | NA | Available |
| Supported languages | English | English, Chinese |
| Controlled access | NA | Available |
| Data transfer | FTP | FTP, Aspera |
| Number of supported sequencing platforms* | 49 | 66 |
| Number of supported data formats* | 9 | 13 |
| Quality control* | Metadata | Metadata, data |
Note: * More details are available at https://ngdc.cncb.ac.cn/gsa/standards. NA, not available.
Figure 1Data model of the GSA family. BioProject and BioSample are two independent meta-information databases, acting as an organizational framework to provide centralized access to descriptive metadata about research projects and samples, respectively. GSA-Human is for archiving human genetic data and OMIX is for various types of data (that are unsuitable for GSA/GSA-Human).
Figure 2Data statistics of the GSA family. A. Number of runs accumulated from 2016 to 2021, with five major species indicated. B. Increase in the volume of submitted data over time. Time needed to accumulate each PB of data is indicated. All statistics were derived from GSA and GSA-Human as of 30 June 2021. PB, petabyte; d, days.
Data items of the GSA family
| No. of projects | 3157 | 764 | 100 | 3321 |
| No. of individuals | / | 68,241 | / | 68,241 |
| No. of samples | 191,754 | 155,490 | / | 347,244 |
| No. of experiments | 183,441 | 175,576 | / | 359,017 |
| No. of runs | 201,583 | 194,394 | / | 395,977 |
| File size (terabyte) | 3704 | 5052 | 1.614 | 8757 |
| No. of registered users | 2438 | 2563 | 120 | 4365 |
Note: All statistics were derived from the GSA family as of 30 June 2021. / means not applicable.