| Literature DB >> 32705130 |
Xueqin Guo1, Fengzhen Chen1, Fei Gao1, Ling Li1, Ke Liu1, Lijin You1, Cong Hua1, Fan Yang1, Wanliang Liu1, Chunhua Peng1, Lina Wang1, Xiaoxia Yang1, Feiyu Zhou1, Jiawei Tong1, Jia Cai1, Zhiyong Li1, Bo Wan1, Lei Zhang1, Tao Yang1, Minwen Zhang1, Linlin Yang1, Yawen Yang1, Wenjun Zeng1, Bo Wang1, Xiaofeng Wei1, Xun Xu1,2,3.
Abstract
With the application and development of high-throughput sequencing technology in life and health sciences, massive multi-omics data brings the problem of efficient management and utilization. Database development and biocuration are the prerequisites for the reuse of these big data. Here, relying on China National GeneBank (CNGB), we present CNGB Sequence Archive (CNSA) for archiving omics data, including raw sequencing data and its further analyzed results which are organized into six objects, namely Project, Sample, Experiment, Run, Assembly and Variation at present. Moreover, CNSA has created a correlation model of living samples, sample information and analytical data on some projects. Both living samples and analytical data are directly correlated with the sample information. From either one, information or data of the other two can be obtained, so that all data can be traced throughout the life cycle from the living sample to the sample information to the analytical data. Complying with the data standards commonly used in the life sciences, CNSA is committed to building a comprehensive and curated data repository for storing, managing and sharing of omics data. We will continue to improve the data standards and provide free access to open-data resources for worldwide scientific communities to support academic research and the bio-industry. Database URL: https://db.cngb.org/cnsa/.Entities:
Mesh:
Year: 2020 PMID: 32705130 PMCID: PMC7377928 DOI: 10.1093/database/baaa055
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Definitions and main fields of data objects
| Data object | Definition | Main fields |
|---|---|---|
| Project | An overall description of a single research initiative | Project name, project title, public description, sample scope, data type, submitter, funding information, publication |
| Sample | A description of biological source material | Sample type, sample name, organism, taxonomy ID, collection, isolate, tissue, location, phenotype, disease |
| Experiment | A description of sample-specific sequencing library, instrument and sequencing methods | File type, sequencing platform, library strategy, library source, library layout |
| Run | A description of the sequencing data files that belong to the related experiment | File name, MD5 value |
| Assembly | A collection of genomic sequences that are used to represent the genome of an organism. | Molecule type, coverage, sequencing technology, assembly method |
| Variation | Genome variations of any species | Variation type, position, variation, detection method, clinical significance, phenotype, condition |
Figure 1Data model in CNSA A. At present, CNSA has six data objects, and the corresponding prefixes of accession numbers are marked in red. B. Correlation model for Ruili Botanical Garden project.
Figure 2Process of data submission to CNSA.
Figure 3Data statistics of CNSA A. Numbers of Projects, Samples, Assemblies, Experiments and runs in CNSA. B. File sizes of Runs and Assemblies in CNSA. All statistics are based on data submitted from November 2017 to May 2020.
Summary of sequence types and amount of several sequence archive databases
| Database | Sequence types | Amount |
|---|---|---|
| INSDC | Next-generation reads, capillary reads, annotated sequences | 7.2 trillion bases |
| TCGA | Genomic, epigenomic, transcriptomic and proteomic sequence reads for tumor and normal samples | 1.4 petabyte |
| GSA | Raw sequence reads of omics | 2.3 petabyte |
| CNSA | Raw sequence reads of omics and assemblies | 2.6 petabyte |
aBased on the data statistics of release 119: https://www.ddbj.nig.ac.jp/stats/release-e.html#data_category
bBased on the data statistics of May 24, 2020: https://portal.gdc.cancer.gov/repository?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.data_category%22%2C%22value%22%3A%5B%22sequencing%20reads%22%5D%7D%7D%5D%7D&searchTableTab=files
cBased on the statistics of May 24, 2020: https://bigd.big.ac.cn/gsa/
dBased on the statistics of May 24, 2020: https://db.cngb.org/cnsa/statistic/