Literature DB >> 34175476

Genome Warehouse: A Public Repository Housing Genome-scale Data.

Meili Chen¹, Yingke Ma¹, Song Wu², Xinchang Zheng¹, Hongen Kang², Jian Sang², Xingjian Xu², Lili Hao¹, Zhaohua Li², Zheng Gong², Jingfa Xiao², Zhang Zhang², Wenming Zhao², Yiming Bao³.

Abstract

The Genome Warehouse (GWH) is a public repository housing genome assembly data for a wide range of species and delivering a series of web services for genome data submission, storage, release, and sharing. As one of the core resources in the National Genomics Data Center (NGDC), part of the China National Center for Bioinformation (CNCB; https://ngdc.cncb.ac.cn), GWH accepts both full and partial (chloroplast, mitochondrion, and plasmid) genome sequences with different assembly levels, as well as an update of existing genome assemblies. For each assembly, GWH collects detailed genome-related metadata of biological project, biological sample, and genome assembly, in addition to genome sequence and annotation. To archive high-quality genome sequences and annotations, GWH is equipped with a uniform and standardized procedure for quality control. Besides basic browse and search functionalities, all released genome sequences and annotations can be visualized with JBrowse. By May 21, 2021, GWH has received 19,124 direct submissions covering a diversity of 1108 species and has released 8772 of them. Collectively, GWH serves as an important resource for genome-scale data management and provides free and publicly accessible data to support research activities throughout the world. GWH is publicly accessible at https://ngdc.cncb.ac.cn/gwh.

Entities: Chemical

Keywords: Genome Warehouse; Genome annotation; Genome sequence; Genome submission; Quality control

Mesh：

Year: 2021 PMID： 34175476 PMCID： PMC9039550 DOI： 10.1016/j.gpb.2021.04.001

Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN： 1672-0229 Impact factor: 6.409

Introduction

Genome sequences and annotations are fundamental resources for a wide range of genome-related studies, including various omics data analysis such as genome [1], transcriptome [2], epigenome [3], [4], and genome variation [5], [6]. China, as one of the most biodiverse countries in the world, harbors more than 10% of the world’s known species [7]. In the past decades, a large number of genomes of featured and important animals and crops in China have been sequenced and assembled [1], [8], [9], [10], [11], most of which were submitted to International Nucleotide Sequence Database Collaboration (INSDC) members [National Center for Biotechnology Information (NCBI), European Bioinformatics Institute (EBI), and DNA Data Bank of Japan (DDBJ)] [12]. With the rapid growth of genome assembly data, in China, for example, the large genome data size, the slow data transfer rate due to limited international network transfer bandwidth, and the language barrier for communication of technical issues have obstructed researchers from efficiently submitting their data to INSDC members. All these call for an additional centralized genomic data repository to complement the INSDC. Here, we report the Genome Warehouse (GWH; https://ngdc.cncb.ac.cn/gwh), a centralized resource housing genome assembly data and delivering a series of genome data services. As one of the core resources in the National Genomics Data Center (NGDC), part of the China National Center for Bioinformation (CNCB; https:// ngdc.cncb.ac.cn) [13], GWH aims to accept data submissions worldwide and provide an important resource for genome data quality control, data archive, rapid release, and public sharing (e.g., with INSDC) in support of research activities from all over the world. By May 21, 2021, GWH has received a total of 19,124 genome submissions (including 51 international submissions), demonstrating its increasingly important role in global genome data management and sharing.

Data model

Designed for compatibility with the INSDC data model, each genome assembly in GWH is linked to BioProject (https://ngdc.cncb.ac.cn/bioproject) and BioSample (https://ngdc.cncb.ac.cn/biosample), which are two fundamental resources for metadata description in CNCB-NGDC. Full or partial (chloroplast, mitochondrion, and plasmid) genome assemblies with different assembly levels (complete, draft in chromosome, scaffold, and contig) are all acceptable, and existing genome assemblies are allowed to be updated. Accession numbers are assigned with the following rules (Fig. 1): 1) each genome assembly has an accession number prefixed with “GWH”, followed by four capital letters and eight zeros (e.g., GWHAAAA00000000); 2) genome sequences have the same accession number format as their corresponding genome assembly, with the exception that the eight digits start from 00,000,001 and increase in order (e.g., GWHAAAA00000001); 3) genes have similar accession patterns as those of genome sequences, with the addition of letter “G” between the GWH prefix and the four capital letters, and there are six digits at the end instead of eight (e.g., GWHGAAAA000001); 4) transcripts use the letter “T” to replace “G” in accession numbers for genes (e.g., GWHTAAAA000001); 5) proteins use the letter “P” to replace “G” in accession numbers for genes (e.g., GWHPAAAA000001); 6) if the submission is an update of existing submission in GWH, it will be assigned a dot and an incremental number to represent the version (e.g., GWHAAAA00000000.1).

Fig. 1

Data model in GWH Genome assembly accession numbers are represented as, for example, “GWHAAAA00000000”, in which the “AAAA” can be replaced by any four other capital English letters representing different genome assemblies. The first genome sequence under the genome assembly is represented as “GWHAAAA00000001”, and other genome sequences under the same genome assembly are represented with the last eight digits increasing in order (“GWHAAAA00000002”, “GWHAAAA00000003”, etc.). For the first gene sequence, transcript sequence, and protein sequence under the genome assembly, the accession numbers are assigned as “GWHGAAAA000001”, “GWHTAAAA000001”, “GWHPAAAA000001”, respectively, and the last six digits are increasing in order for other genes, transcripts, and proteins.

Database components

GWH is a centralized resource housing genome-scale data with the purpose to archive high-quality genome sequences and annotation information. GWH is equipped with a series of web services for genome data submission, release, and sharing, accordingly involving three major components, namely, data submission, quality control, and archive and release (Fig. 2).

Fig. 2

Major components in GWH data processing workflow.

Data submission

GWH not only accepts genome assembly associated data through an online submission system but also allows offline batch submissions. Users need to register first and then provide a complete description of submitted genome sequences. Biological project and sample information should be provided through BioProject and BioSample (two fundamental resources in CNCB-NGDC), respectively, together with genome assembly sequence, annotation, and associated metadata. Metadata mainly consists of a variety of information about submitter, general assembly, file(s), sequence assignment, and publication (if available). After submission, GWH runs an automated quality control pipeline to check the validity and consistency of submitted genome sequence and genome annotation files. Accession numbers are assigned to assemblies and sequences upon the pass of quality control. The updated assembly data can also be submitted to GWH. It should be noted that compatible with the INSDC members (e.g., NCBI GenBank), it is the responsibility of the submitters to ensure the data quality, completeness, and consistency, and GWH does not warrant or assume any legal liability or responsibility for the data accuracy.

Quality control

After metadata and file(s) are received, GWH automatically runs standardized quality control to check 45 different types of errors in submitted genome sequences and annotations, and to scan for contaminated genome sequences (see details at https://ngdc.cncb.ac.cn/gwh/documents) if needed (Fig. 2), which roughly falls into 5 quality control steps. 1) The component will check the consistency of file(s) according to the filename and md5 code. 2) For genome sequences, the component will check the legality of genome sequence ID and sequence content, e.g., unique sequence ID, sequence composition (A/T/C/G or degenerate base), and sequence length (≥200 bp). 3) For genome annotations, the component will check gene structure completeness and consistency, e.g., unique ID, an exon/CDS/UTR coordinate falling within the corresponding gene coordinate, strand consistency for all features (including gene/transcript/exon/CDS/UTR), and codon validity (e.g., valid start/stop codon and no internal stop codon). 4) Finally, it will check the internal consistency of genome sequence and annotation. For example, sequence ID in genome annotation must match genome sequence ID; a feature coordinate must fall within the range of the corresponding genome sequence. 5) Genome sequences will also be scanned to check vectors, adaptors, primers, and indices (collected from UniVec database, ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec) using NCBI’s VecScreen (https://www.ncbi.nlm.nih.gov/tools/vecscreen). If there is an error, a report will be automatically sent to the submitter by email. To finish a successful submission, the submitter needs to fix all errors and resubmit files until they pass the quality control process.

Archive and release

GWH will assign a unique accession number to the submitted genome assembly upon the pass of quality control, allot accession numbers for each genome sequence, gene, transcript, and protein, and generate and backup downloadable files of genome sequence and annotation in FASTA, GFF3, and TSV formats. Data generation is performed with in-house scripts based on submitted genome sequence and annotation files. In order to ensure the security of submitted data, a copy of backup data is stored on a physically separate disk. GWH will release sequence data on a user-specified date, unless a paper citing the sequence or accession number is published prior to the specified release date, in which case the sequence will be released immediately. For the released data, GWH will generate web pages containing two primary tables: genome and assembly. The former shows species taxonomy information and genome assemblies, and the latter contains general information of the assembly (including external links to other related resources), statistics of genome assembly, and its corresponding annotation. All released data are publicly available at GWH FTP site (ftp://download.cncb.ac.cn/gwh). GWH provides data visualization for both genome sequence and genome annotation using JBrowse [14]. It offers statistics and charts in light of total holdings, assembly levels, genome representations, citing articles, submitting organizations, sequencing platforms, assembly methods, and downloads. GWH provides user-friendly web interfaces for data browse and query using BIG Search [13], in order to help users find any released data of interest. For a released genome assembly, GWH also provides machine-readable application programming interfaces (APIs) for publicly sharing and automatically obtaining information on its associated BioProject, BioSample, genome, and assembly metadata and file paths.

Global sharing of SARS-CoV-2 and coronavirus genomes

During the COVID-19 outbreak, GWH, in support of the 2019 Novel Coronavirus Resource (2019nCoVR) [15], [16], has received worldwide submissions of more than a thousand SARS-CoV-2 genome assemblies with standardized genome annotations [17] and has released 298 of them. To expand the international influence of data, 62 of the released sequences have been shared, with the submitters’ permission, in GenBank [18] through a data exchange mechanism established with NCBI. In this model, GWH accessions are represented as secondary accessions in NCBI GenBank records, which are retrievable by the NCBI Entrez system. This model sets a good example for data sharing among different data centers. In addition, GWH offers sequences of the Coronaviridae family to facilitate researchers to reach the data conveniently and thus to study the relationship between SARS-CoV-2 and other coronaviruses. To promote the data sharing and make all relevant information of the coronaviruses readily available, GWH integrates genomic and proteomic sequences as well as their metadata information from NCBI [19], China National GeneBank Database (CNGBdb) [20], National Microbiology Data Center (NMDC) [21], and CNCB-NGDC. Duplicated records from different sources are identified and removed to gain a non-redundant dataset. As of May 21, 2021, the dataset has 163,637 nucleotide sequences and 1,475,933 protein sequences of the coronaviruses. Filters are implemented to narrow down the required coronavirus sequences using multiple conditions, including country/region, host, isolation source, length, and collection date. Both the metadata and sequences of the filtered results can be selected and downloaded as a separate file. The daily updated sequences and all sequences can also be downloaded from FTP (ftp://download.cncb.ac.cn/Genome/Viruses/Coronaviridae).

Data statistics

By May 21, 2021, GWH has received 19,124 direct submissions covering a broad diversity of species (Table 1) with different assembly levels (Fig. 3). These genome assemblies link to 367 BioProjects and 17,513 BioSamples, and are submitted by 269 submitters from 61 institutions (including 5 international submitters from 2 countries). There are a total of 8772 released submissions, which were reported in 96 articles from 47 journals. GWH has over 160,000 visits from 157 countries/regions, with ∼ 1,600,000 downloads. The amount of data, visits, and downloads in the GWH has been on a dramatic increase over the past years, clearly showing its great utility in genome-scale data management.

Table 1

Total data holdings in GWH.

Status	Type	Animal	Plant	Fungus	Bacterium	Archaea	Virus	Metagenome	Others	Total
Released	Assembly	531(6.05%)	251(2.86%)	16(0.18%)	291(3.32%)	103(1.17%)	915(10.43%)	6651(75.82%)	14(0.16%)	8772
Released	Species	90(21.28%)	159(37.59%)	14(3.31%)	109(25.77%)	11(2.60%)	23(5.44%)	5(1.18%)	12(2.84%)	423
Unpublic	Assembly	7490(72.35%)	1334(12.89%)	104(1.00%)	76(0.73%)	19(0.18%)	858(8.29%)	10(0.10%)	461(4.45%)	10,352
Unpublic	Species	38(5.31%)	642(89.66%)	7(0.98%)	8(1.12%)	5(0.70%)	4(0.56%)	3(0.42%)	9(1.26%)	716
Total	Assembly	8021(41.94%)	1585(8.29%)	120(0.63%)	367(1.92%)	122(0.64%)	1773(9.27%)	6661(34.83%)	475(2.48%)	19,124
Total	Species	125(11.28%)	786(70.94%)	20(1.81%)	113(10.20%)	13(1.17%)	25(2.26%)	7(0.63%)	19(1.71%)	1108

Note: The numbers of genome assemblies and covering species are those directly submitted to GWH, and their percentages (in parentheses) for different organism groups are presented. GWH, Genome Warehouse.

Fig. 3

Statistics of genome assemblies in GWH (as of May 21, 2021) A. All assemblies. B. Publicly released assemblies. Assemblies at contig, scaffold, chromosome, and complete levels are shown in different colors.

Total data holdings in GWH. Note: The numbers of genome assemblies and covering species are those directly submitted to GWH, and their percentages (in parentheses) for different organism groups are presented. GWH, Genome Warehouse. Statistics of genome assemblies in GWH (as of May 21, 2021) A. All assemblies. B. Publicly released assemblies. Assemblies at contig, scaffold, chromosome, and complete levels are shown in different colors.

Summary and future directions

Collectively, GWH is a user-friendly portal for genome data submission, release, and sharing associated with a matched series of services. The rapid growth of genome assembly submissions demonstrates the great potential of GWH as an important resource for accelerating worldwide genomic research. With the goal to fully realize the findability, accessibility, interoperability, and reusability (FAIR) of genome data [22], GWH has made ongoing efforts, including but not limited to, improvement of web interfaces for data submission, presentation, and visualization, continuous integration of newly sequenced genomes, and development of useful online tools to help users analyze genome data (such as BLAST [23]). Therefore, we will put in more efforts to provide genome annotation services, especially for bacteria and archaea genomes, with the particular consideration that uniform standardized annotation determines the accuracy of downstream data analysis. Besides, we will expand the coronavirus dataset to other important pathogens to improve the ability of public health emergency response. Finally, we plan to share and exchange all public genome assembly data with the INSDC members to provide comprehensive data for global researchers.

Data availability

Genome Warehouse is freely accessible at https://ngdc.cncb.ac.cn/gwh.

Competing interests

The authors have declared no competing interests.

CRediT authorship contribution statement

Meili Chen: Methodology, Software, Investigation, Data curation, Writing – original draft, Project administration. Yingke Ma: Software, Writing – original draft. Song Wu: Software, Data curation. Xinchang Zheng: Data curation. Hongen Kang: Software. Jian Sang: Investigation, Data curation. Xingjian Xu: Software. Lili Hao: Investigation. Zhaohua Li: Data curation. Zheng Gong: Data curation. Jingfa Xiao: Writing – review & editing. Zhang Zhang: Writing – review & editing. Wenming Zhao: Writing – review & editing. Yiming Bao: Conceptualization, Writing – review & editing, Supervision.

21 in total

1. The HuangZaoSi Maize Genome Provides Insights into Genomic Variation and Improvement History of Maize.

Authors: Chunhui Li; Wei Song; Yingfeng Luo; Shenghan Gao; Ruyang Zhang; Zi Shi; Xiaqing Wang; Ronghuan Wang; Fengge Wang; Jidong Wang; Yanxin Zhao; Aiguo Su; Shuai Wang; Xin Li; Meijie Luo; Shuaishuai Wang; Yunxia Zhang; Jianrong Ge; Xinyu Tan; Ye Yuan; Xiaochun Bi; Hang He; Jianbing Yan; Yuandong Wang; Songnian Hu; Jiuran Zhao
Journal: Mol Plant Date: 2019-02-23 Impact factor: 13.164

2. CNGBdb: China National GeneBank DataBase.

Authors: Feng Zhen Chen; Li Jin You; Fan Yang; Li Na Wang; Xue Qin Guo; Fei Gao; Cong Hua; Cong Tan; Lin Fang; Ri Qiang Shan; Wen Jun Zeng; Bo Wang; Ren Wang; Xun Xu; Xiao Feng Wei
Journal: Yi Chuan Date: 2020-08-20

3. The 2019 novel coronavirus resource.

Authors: Wen-Ming Zhao; Shu-Hui Song; Mei-Li Chen; Dong Zou; Li-Na Ma; Ying-Ke Ma; Ru-Jiao Li; Li-Li Hao; Cui-Ping Li; Dong-Mei Tian; Bi-Xia Tang; Yan-Qing Wang; Jun-Wei Zhu; Huan-Xin Chen; Zhang Zhang; Yong-Biao Xue; Yi-Ming Bao
Journal: Yi Chuan Date: 2020-02-20

4. GenBank.

Authors: Eric W Sayers; Mark Cavanaugh; Karen Clark; James Ostell; Kim D Pruitt; Ilene Karsch-Mizrachi
Journal: Nucleic Acids Res Date: 2020-01-08 Impact factor: 16.971

5. World data centre for microorganisms: an information infrastructure to explore and utilize preserved microbial strains worldwide.

Authors: Linhuan Wu; Qinglan Sun; Philippe Desmeth; Hideaki Sugawara; Zhenghong Xu; Kevin McCluskey; David Smith; Vasilenko Alexander; Nelson Lima; Moriya Ohkuma; Vincent Robert; Yuguang Zhou; Jianhui Li; Guomei Fan; Supawadee Ingsriswang; Svetlana Ozerskaya; Juncai Ma
Journal: Nucleic Acids Res Date: 2016-10-07 Impact factor: 16.971

6. Whole-genome and time-course dual RNA-Seq analyses reveal chronic pathogenicity-related gene dynamics in the ginseng rusty root rot pathogen Ilyonectria robusta.

Authors: Yiming Guan; Meili Chen; Yingying Ma; Zhenglin Du; Na Yuan; Yu Li; Jingfa Xiao; Yayu Zhang
Journal: Sci Rep Date: 2020-01-31 Impact factor: 4.379

7. The international nucleotide sequence database collaboration.

Authors: Masanori Arita; Ilene Karsch-Mizrachi; Guy Cochrane
Journal: Nucleic Acids Res Date: 2020-11-09 Impact factor: 16.971

8. JBrowse: a dynamic web platform for genome visualization and analysis.

Authors: Robert Buels; Eric Yao; Colin M Diesh; Richard D Hayes; Monica Munoz-Torres; Gregg Helt; David M Goodstein; Christine G Elsik; Suzanna E Lewis; Lincoln Stein; Ian H Holmes
Journal: Genome Biol Date: 2016-04-12 Impact factor: 13.583

9. iDog: an integrated resource for domestic dogs and wild canids.

Authors: Bixia Tang; Qing Zhou; Lili Dong; Wulue Li; Xiangquan Zhang; Li Lan; Shuang Zhai; Jingfa Xiao; Zhang Zhang; Yiming Bao; Ya-Ping Zhang; Guo-Dong Wang; Wenming Zhao
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

10. EWAS Data Hub: a resource of DNA methylation array data and metadata.

Authors: Zhuang Xiong; Mengwei Li; Fei Yang; Yingke Ma; Jian Sang; Rujiao Li; Zhaohua Li; Zhang Zhang; Yiming Bao
Journal: Nucleic Acids Res Date: 2020-01-08 Impact factor: 16.971

16 in total

1. Database Resources of the National Genomics Data Center, China National Center for Bioinformation in 2022.

Authors:
Journal: Nucleic Acids Res Date: 2022-01-07 Impact factor: 16.971

2. Comparative Genomics of Thaumarchaeota From Deep-Sea Sponges Reveal Their Niche Adaptation.

Authors: Peng Wang; Minchun Li; Liang Dong; Cheng Zhang; Wei Xie
Journal: Front Microbiol Date: 2022-07-04 Impact factor: 6.064

3. The Origin and Evolution of RNase T2 Family and Gametophytic Self-incompatibility System in Plants.

Authors: Shouzheng Lv; Xin Qiao; Wei Zhang; Qionghou Li; Peng Wang; Shaoling Zhang; Juyou Wu
Journal: Genome Biol Evol Date: 2022-07-02 Impact factor: 4.065

4. Genome-Wide Characterization and Expression Analysis of GATA Transcription Factors in Response to Methyl Jasmonate in Salvia miltiorrhiza.

Authors: Haiyan Li; Tianrang Liu; Biao Wang; Hongbo Li
Journal: Genes (Basel) Date: 2022-05-04 Impact factor: 4.141

5. Genetic Signatures from Adaptation of Bacteria to Lytic Phage Identify Potential Agents To Aid Phage Killing of Multidrug-Resistant Acinetobacter baumannii.

Authors: Greater Kayode Oyejobi; Dongyan Xiong; Mengjuan Shi; Xiaoxu Zhang; Hang Yang; Heng Xue; Faith Ogolla; Hongping Wei
Journal: J Bacteriol Date: 2022-02-14 Impact factor: 3.476

6. Coronavirus GenBrowser for monitoring the transmission and evolution of SARS-CoV-2.

Authors: Dalang Yu; Xiao Yang; Bixia Tang; Yi-Hsuan Pan; Jianing Yang; Guangya Duan; Junwei Zhu; Zi-Qian Hao; Hailong Mu; Long Dai; Wangjie Hu; Mochen Zhang; Ying Cui; Tong Jin; Cui-Ping Li; Lina Ma; Xiao Su; Guoqing Zhang; Wenming Zhao; Haipeng Li
Journal: Brief Bioinform Date: 2022-03-10 Impact factor: 11.622

7. CompoDynamics: a comprehensive database for characterizing sequence composition dynamics.

Authors: Shuai Jiang; Qiang Du; Changrui Feng; Lina Ma; Zhang Zhang
Journal: Nucleic Acids Res Date: 2022-01-07 Impact factor: 16.971

8. Developing an Amplification Refractory Mutation System-Quantitative Reverse Transcription-PCR Assay for Rapid and Sensitive Screening of SARS-CoV-2 Variants of Concern.

Authors: Dongyan Xiong; Xiaoxu Zhang; Mengjuan Shi; Nuo Wang; Ping He; Zhuo Dong; Jie Zhong; Jing Luo; Yong Wang; Junping Yu; Hongping Wei
Journal: Microbiol Spectr Date: 2022-01-05

9. Reassessment of Annamocarya sinesis (Carya sinensis) Taxonomy through Concatenation and Coalescence Phylogenetic Analysis.

Authors: Jie Luo; Junhao Chen; Wenlei Guo; Zhengfu Yang; Kean-Jin Lim; Zhengjia Wang
Journal: Plants (Basel) Date: 2021-12-24

10. Origin, loss, and regain of self-incompatibility in angiosperms.

Authors: Hong Zhao; Yue Zhang; Hui Zhang; Yanzhai Song; Fei Zhao; Yu'e Zhang; Sihui Zhu; Hongkui Zhang; Zhendiao Zhou; Han Guo; Miaomiao Li; Junhui Li; Qiang Gao; Qianqian Han; Huaqiu Huang; Lucy Copsey; Qun Li; Hua Chen; Enrico Coen; Yijing Zhang; Yongbiao Xue
Journal: Plant Cell Date: 2022-01-20 Impact factor: 11.277